Question

Concatenate multiple paired-end reads or interleaved which is best approach to align with BWA-mem2

0

Entering edit mode

2.8 years ago

ben@f ▴ 20

Hello biostars,

I'm doing genome alignment with BWA-mem2. I got the raw-reads datasets of ten samples stored in ten folders; each has multiple paired-end (PE) fastq files of Illumina short reads (len_seq= 151 bp) loaded to (4 lanes) two fastq per lan, in a total of eight fastq/sample.

My goal is to do the genome alignment of each individual (sample) to the reference genome of an avian specie and address some genetic of population parameters.

I saw that it is okay to concatenate multiple paired-end, even with different lanes into one fastq file.

So I concatenated all the fastq files of one of my samples and followed two concatenation approaches.

A- I concatenated all the paired-end reads into one PE file (R1, R2), then aligned with BWA-mem2 (Use precompiled binaries) in the PE mode.

B- concatenated all the PE fastq files into one fastq file and rerun BWA-mem2 in SE mode. The results are shown below.

A - Bwa-mem2 output

Time taken for main_mem function: 26817.94 Sec

IO times (sec) :
Reading IO time (reads) avg: 373.65, (373.65, 373.65)
Writing IO time (SAM) avg: 1362.58, (1362.58, 1362.58)
Reading IO time (Reference Genome) avg: 0.96, (0.96, 0.96)
Index read time avg: 6.61, (6.61, 6.61)

Overall time (sec) (Excluding Index reading time):
PROCESS() (Total compute time + (read + SAM) IO time) : 26809.99
MEM_PROCESS_SEQ() (Total compute time (Kernel + SAM)), avg: 26809.28, (26809.28, 26809.28)

 SAM Processing time (sec):
--WORKER_SAM avg: 6663.39, (6663.39, 6663.39)

Kernels' compute time (sec):
Total kernel (smem+sal+bsw) time avg: 20103.30, (20103.30, 20103.30)
    SMEM compute avg: 13512.54, (13512.54, 13512.54)
    SAL compute avg: 1294.74, (1294.74, 1294.74)
    BSW time, avg: 4891.20, (4891.20, 4891.20)

Total allocs: 219754346 = 219754346 out total requests: 893438606, Rate: 0.25

B - results

Time taken for main_mem function: 42463.53 Sec

IO times (sec) :
Reading IO time (reads) avg: 716.52, (716.52, 716.52)
Writing IO time (SAM) avg: 2242.50, (2242.50, 2242.50)
Reading IO time (Reference Genome) avg: 0.89, (0.89, 0.89)
Index read time avg: 6.42, (6.42, 6.42)

Overall time (sec) (Excluding Index reading time):
PROCESS() (Total compute time + (read + SAM) IO time) : 42455.98
MEM_PROCESS_SEQ() (Total compute time (Kernel + SAM)), avg: 42454.83, (42454.83, 42454.83)

 SAM Processing time (sec):
--WORKER_SAM avg: 1675.04, (1675.04, 1675.04)

Kernels' compute time (sec):
Total kernel (smem+sal+bsw) time avg: 40779.49, (40779.49, 40779.49)
    SMEM compute avg: 27730.66, (27730.66, 27730.66)
    SAL compute avg: 2570.63, (2570.63, 2570.63)
    BSW time, avg: 9678.29, (9678.29, 9678.29)

Total allocs: 439503376 = 439503376 out total requests: 1786877212, Rate: 0.25

I have some questions:

Regarding the alignment and downstream analysis, which is the better approach for me? Concatenate or interleaved the paired-end reads and how that possible for multiple paired-end reads
Regarding the BWA-mem2 output, is there any spurious issue I should be aware of?

I'm new to this and would appreciate any suggestions or recommendations regarding this case. Thank you in advance.

interleaved paired-end bwa-mem2 • 1.9k views

ADD COMMENT • link 2.8 years ago by ben@f ▴ 20

score 2 · Accepted Answer · 2022-10-12

A- I concatenated all the paired-end reads into one PE file (R1, R2), then aligned with BWA-mem2 (Use precompiled binaries) in the PE mode.

As long as you concatenated R1 files and R2 files for lane specific data in exactly the same order for both this is the approach to take.

B- concatenated all the PE fastq files into one fastq file and rerun BWA-mem2 in SE mode. The results are shown below.

This is not the correct way to do things though technically it is ok to treat PE data as two single end reads.. You are throwing away the advantage of having paired-reads which allows the aligner to estimate the insert size by considering how far apart the two reads align on the reference.

What you have shown as results are simply some stats about time taken by bwa to do various operations. This information is not useful to make any judgement about the actual output you got.

You should instead be looking at results of samtools idxstats to get some basic information about how well your alignments performed.