Dear Community,
I have been working on estimating the relative abundancy of selected bacterial strains in a metagenomic dataset, derived from whole genome sequencing using Illumina TruSeq kit + HiSeq sequencer. My thought is to do initial quality control using Trimmomatic, followed by mapping to reference bacterial genomes using Bowtie2. I read that I can merge forward and reverse reads before mapping, using tools like PEAR.
Now I am wondering if this merging step is recommended or not? Will this make the subsequent mapping more accurate?
Also, since I am quantifying reads mapped to each bacterial genome. Once I merge the reads and map, I should treat merged and unmerged reads differently in calculation. For example, One hit from a merged read should be counted twice, as compared to two hits from both forward and reverse reads. Am I right?
Thanks in advance!
Thanks for your valuable information!
I have been thinking that, wouldn't longer reads be mapped to the reference more accurately?
Yes, but aligners also try to keep pairs together. So if read 1 could map to 5 locations, and read 2 could map to 3 locations, but there is only one location where both could map nearby, that is the site that will be selected. So there should not be much difference in sensitivity or specificity between paired reads and merged reads.
That sounds reasonable. Thanks for your explanation!
Hi Brian,
I have some confusion regarding your explanation, could you please clarify them to me?
I understand that when merging 2 paired reads, we only merge the overlap part of them, if they have innie-orientation, only the end (arrow head) parts of them are merged, the larger tail head will remain the same, and isn't that the tail path of both read are prone for mapping? So mapping would not be affected, isn't it? And if we keep pairing information by merging into longer read, will it increase accuracy in mapping?
I could present this idea by the illustration below:
Thank you in advance for your ideas and suggestion!
A correctly merged read will map more accurately than either read1 or read2 alone, because it is longer. But when mapped as a pair, the accuracy should be similar whether merged or unmerged. Merging has the advantage of reducing the substitution error rate in the overlapping region, but it has the disadvantage of potentially introducing indels in false-positive merges. That's very rare with BBMerge, though.
Thank you Brian!