Hi folks,
I have around 60 metagenomic read sets from 60 different individuals (one sample per person). I want to find areas of differential mapping between these samples. In other words, do sample A and B both share similar mapping patterns to contig A? I have tried assembling samples separately, then combining assemblies and deduplicating the assembly using dedupe.sh
from BBMap. This still left a lot of similar contigs and negatively impacted mapping quality due to duplicated sequences in the assemblies. I can play around with increasing the minimum sequence identity for the deduplication step, but the extent that I needed to deduplicate removed a good amount of sequence, which could confound our analysis.
I have now started wondering if co-assembling our samples together would be a better approach than assembling samples separately. My concern is that co-assembling possibly disparate samples would be create assemblies based on non-real read combinations from different samples.
Please let me know if you folks have any advice!
Did you turn the following option on when you ran
dedeupe.sh
?You may want to try
CD-HIT
for removing redundancy as well.The
cluster
option indedupe.sh
keeps running out of memory, even when it ran using 9 relatively short (<20kb) contigs and 70 Gb of RAM.Depending on the size of your data there may be no way around that. You may run into the same issue with other software.
I think we should be able to assemble just fine with the samples that we have. My concern is that the assembly might not be representative.
While the representativeness is an important issue it sounds like the quality of the assemblies you are ending up with is a more pressing problem since you seem to have duplication that you have not been able to address.
The duplication is on the level of contigs, which I am trying to address. Do you not think co-assembly would remove this issue?
That is difficult to answer. You best bet may be to try that out.