Question

Merging BAM vs concatenate FASTQ

0

Entering edit mode

4.3 years ago

nhaus ▴ 420

Hi,

i am fairly new to bioinformatics (genomics to be specific) so excuse me if this is a straight forward question.

I have perforemd paired end WES which was performed across different lanes, so I have 2 fastq files per sample (4 in total).

I know that I can merge BAM files after aligning each fastq like this:


bwa mem  lane1_R1.fq lane1_R2.fq | samtools view -o lane1.bam

bwa mem lane2_R1.fq lane2_R2.fq | samtools view -o lane2.bam

samtools merge merged.bam lane1.bam lane2.bam

If you dont care about read groups or potential batch effects, is it also possible to just concatenate lane1_R1 and lane2_R1 and then do the alignment, so something like this:


cat lane1_R1.fq lane2_R1.fq > WES_R1.fq

cat lane1_R2.fq lane2_R2.fq > WES_R2.fq

bwa mem WES_R1.fq WES_R2.fq

If anyone could tell me what the "best practice" for this is, I'd be very thankful!

Cheers!

sequencing genome • 3.0k views

ADD COMMENT • link 4.3 years ago by nhaus ▴ 420

0

Entering edit mode

Good question! I don't see this as a rule of what you have to do... I like to concatenate all my R1 and R2 files before alignment. But it should be fine to align your BAM files also!

ADD REPLY • link 4.3 years ago by brunobsouzaa ▴ 830

0

Entering edit mode

Cross-posted on reddit: https://www.reddit.com/r/bioinformatics/comments/idum3k/question_merging_bam_vs_concatenate_fastq/

What's up with that, OP?

ADD REPLY • link 4.3 years ago by Ram 44k

score 0 · Answer 1 · 2020-08-21

0

Entering edit mode

4.3 years ago

GenoMax 147k

It should be fine to merge the files before alignment. Processing individual sequence files can allow you to process your data in parallel in case you have access to a compute cluster.

ADD COMMENT • link 4.3 years ago by GenoMax 147k

0

Entering edit mode

If you merge, however, make sure you document that properly. I am currently dealing with legacy data, where someone seems to have merged different runs (different instruments, different library types). While this doesn't seem to have an effect on our processing, it is still very confusing.

Also, I see you're having uncompressed data. "Best practice" would be to have the fastqs compressed and you can even concatenate gzipped files (with cat) without decompressing them first.