Hello guys, I am relatively new to metagenomic data analysis and even bioinformatics in general. To maybe give you a brief idea of what my issue is, I will just explain what my data is about: the data generated derives from ~ 60 samples of phage enrichments, where I have multiple biological replicates (3-5) of each sample. Therefore the data is not strictly "metagenomic" because they have been enriched once but the phage in the samples are not known and there should be more than just one phage for example because they derive from an environmental sample. Also host dna has been removed in the extraction step.
I have already finished the QC, Trimming and Assembly and I am now left with a lot of contigs in each sample and I am unsure on how to proceed. I think the next step would be to bin my configs but I have already tried multiple programs like CDHit and Metabat2 and the results do not really convince me. CD hit appeared a bit to "easy" because of the heuristic approach but with metabat2 I also get around 1-2 bins with only one config which is also mostly classified as complete in CheckV but the rest of the bins contains 2-7 contigs each and I am unsure how to proceed with this. Before binning I removed all contigs <1kb.
The goal is to know which phages are in the samples (I would have used BLASTn for this?) and how abundant they are and how this differs between the sample groups.
Another thing is the k-mer size in metaSPAdes which I used to assemble. In most programs I focused on keeping most of the default settings because I am not as experienced but in SPAdes I noticed that it automatically only assembled with k=21,33,55. Online I read that some people also assemble with k=77, I tried this but mostly the results were equal.
I really appreciate any help, let me know if there might be information I need to give you.
Update - 17.11.23:
Hey guys, I am sorry to bother so much but I am still confused and not sure how to proceed further. I have completed QC, Trimming and Assembly. I am now set with all my contigs which I also checked with CheckV and VirSorter2. The biggest contigs >20-30 kbp in my 30 samples are mostly even marked as complete but I am also having a lot of small contigs (up to 9000 in some samples).
My plan was to set a length cutoff of 5000 bp and then assign Taxonomy to these with. BLASTn and then manually checking the samples and assign the contigs to vOTUs.
All my efforts with binning the contigs do not really work since MetaBat sometimes gives me bins of only 1 config (the longest) and then another 2nd bin with random contigs from one sample but does not bin the 2nd, 3rd or 4th longest config for example although they do not have bad quality in CheckV. What also makes the binning difficult is that I have 30 samples but really they are just 6 samples but with 5 biological replicates each so I am not sure if one should concatenate the contigs from these groups or if I can move on with all samples separately. Until now I did QC, Trimming and Assembly for every sample separately.
PS: Sorry for posting this as a reply first!
Take a look at:
https://github.com/mtisza1/Cenote-Taker2
https://serratus.io/