Hi!
I'm testing detection and abundance of viruses in metagenomics (viromics). In (Roux et al, 2017), The authors state the use of nucmer (from mummer) to cluster the contigs:
' Contigs from all samples were clustered with nucmer (Delcher, Salzberg & Phillippy, 2003) at ≥95% ANI across ≥80% of their lengths, as in (Brum et al., 2015; Gregory et al., 2016), to generate a pool of non-redundant “population contigs” '
Where I'm stuck:
- I have all my contigs from all the samples (which are grouped by experimental conditions) in only one denovo assembly file (with megahit). There is no reference; the samples come from mouse gut. Theoretically, there are several (probably unknown) genomes.
- nucmer has at least two obligatory multifasta inputs; a reference and the query.
What am I supposed to do?
- Merge a selection of viral genomes and use it as reference?
- Assemble the samples/groups separately and then use one assembly as reference?
- Use the same assembly file as both reference and query?
- Split the assembly file then use one (maybe the largest) contig as the reference?
- Anything else?
Alternatively, I've performed clustering with CD-HIT. Would nucmer be better at clustering? I can only answer that if I could somehow run nucmer.
If anyone has good experience with another viromics pipeline, I would be happy to test it. Any help will be very much appreciated. Thanks!
Based on this post I guess the deal is aligning my contigs assembly file to itself.