Question

How Should I Deal with Paired-End Shotgun Metagenomic Reads for DIAMOND Analysis?

0

Entering edit mode

3.5 years ago

ian.petersen ▴ 10

Hi,

I'm trying to work with some Illumina shotgun metagenomic reads (2x150bp). I've tried merging both the forward and reverse reads with BBmerge and PEAR but both tools only merge about 30% of the reads at the most.

Would I be right in assuming that this is due to the shotgun shearing producing some larger inserts where the forward and reverse reads never actually overlap?
If this is the case, would there be any benefit to merging the reads before DIAMOND analysis, or would just processing Read 1 and Read 2 separately be preferred?

In a protocol for DIAMOND and MEGAN analysis here, the suggest merging paired end reads using fastq-join (which I assume would give similar results to BBmerge and PEAR) and then concatenating the merged reads as well as the unmerged reads together to ensure all of the data is retained.

What would be the benefit of merging the reads at all if they are just getting combined with the unmerged reads anyway before analysis (other than having a single input file for DIAMOND)?

Thanks,

Ian

diamond metagenomics megan paired-end • 2.5k views

ADD COMMENT • link updated 3.4 years ago by h.mon 35k • written 3.5 years ago by ian.petersen ▴ 10

1

Entering edit mode

Hi ian.petersen

Why don't you assemble the reads into contigs, without worrying too much about merging, and then run the DIAMON/MEGAN pipeline? Having longer sequences you would greatly improve the taxonomic and functional classification

ADD REPLY • link 3.5 years ago by andres.firrincieli 3.8k

score 1 · Answer 1 · 2021-07-07

this is due to the shotgun shearing producing some larger inserts

Yes, shotgun libraries will result in a range of insert sizes, so reads from the shorter inserts will merge, and reads from the longer inserts won't merge.

What would be the benefit of merging the reads

I guess the longer reads will lead to more precise similarity search results, as the longer reads will potentially lead to better, longer alignments, reducing the effect of shorter, spurious hits. However, it is strange the protocol just concatenates the merged and unmerged reads, as this will give more weight to the unmerged reads (potentially being counted twice) in comparison to the merged reads. In practice, it shouldn't matter much, but I could see this leading to biases, as differences in genome characteristics (e.g., GC content) could lead to systematic biases in insert sizes for different organisms.