What is the purpose of the following excerpt from paper using metagenomic assemblies:
Redundancies of sequences from the same organism within the metagenome were removed by clustering all contigs at 95% identity with CD-hit v4.6.6 (72), and only the longest contig per cluster was kept
I understand what is being done but I don't know why. I could see use if binning to reduce computation time perhaps but otherwise I am not sure? Would it able reduce annotation time or something similar?
The paper in question under the methods section "Metagenome sequencing and assembly.": https://journals.asm.org/doi/10.1128/mSphere.00165-19
95% similarity sounds to me like they're removing real diversity from the assembly, not "same organism redundancy"
This was my thought also, at a minimum you would cluster by 97% would you not?
I don't think any metagenome assembler (be it OLC or de Bruijn graph-based) would produce such redundancy anyway..