I have conducted a simulation to determine the optimal tool for merging paired-end reads. Currently, I am evaluating PEAR.
The procedure involves generating two FASTQ files, each containing 153-base reads. These reads share identical sequences in the 50-base segments at both ends, while the three central bases are varied to facilitate the simulation. Specifically, for the R1 reads, the middle three bases are GGG, and for the R2 reads, they are AAA, as illustrated below.
(RACP) [mdb1c20@cyan51 combinatory_model]$ head simulated_R1_dif.fastq
@read_1 R1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!!!IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read_2 R1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!!"IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read_3 R1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
(RACP) [mdb1c20@cyan51 combinatory_model]$ head simulated_R2_dif.fastq
@read_1 R2
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!!!IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read_2 R2
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"!!IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read_3 R2
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
In all R1 and R2 reads, the sequences are consistent as previously described. I employ a combinatorial method to generate all possible quality scores for these reads. The objective of this simulation is to observe the outcome of the reads merged using PEAR. Based on my understanding, I anticipate that in the merged reads, the base selected should be the one from either R1 or R2 that possesses the highest quality score. Importantly, I expect this process to result in the retention of the original bases from R1 or R2 based on their quality scores, rather than the introduction of a new variant. This expectation aligns with the principle that merged read quality should reflect the highest confidence base calls from the pair, without generating novel base variations not present in the original reads.
(RACP) [mdb1c20@cyan51 combinatory_model]$ head ./pear_data/pear_output.fastq.assembled.fastq
@read_1 R1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!!!IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read_2 R1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!!"IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read_3 R1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why do I get T in the merged read when in R1 I have G and in R2 I have A?? I don't understand...
The code used for PEAR
pear \
-f ./simulated_R1_dif.fastq \
-r ./simulated_R2_dif.fastq \
-o ./pear_data/pear_output.fastq
NGmerge seems to do what is expected
R1 Base-Quality (Base 1) R2 Base-Quality (Base 1) Merged Base-Quality (Base 1) ... R1 Base-Quality (Base 3) R2 Base-Quality (Base 3) Merged Base-Quality (Base 3)
0 G-0 A-0 G-0 ... G-0 A-0 G-0
1 G-0 A-1 G-0 ... G-1 A-0 G-1
2 G-0 A-2 G-0 ... G-2 A-0 G-2
3 G-0 A-3 G-0 ... G-3 A-0 G-3
4 G-0 A-4 G-0 ... G-4 A-0 G-4
... ... ... ... ... ... ... ...
68916 G-40 A-36 G-40 ... G-36 A-40 G-36
68917 G-40 A-37 G-40 ... G-37 A-40 G-37
68918 G-40 A-38 G-40 ... G-38 A-40 G-38
68919 G-40 A-39 G-40 ... G-39 A-40 G-39
68920 G-40 A-40 G-40 ... G-40 A-40 G-40
Is PEAR allowing one error in the overlap?