Hi everyone, I've started learning about shotgun metagenomics sequencing. My understanding is there are two main approaches: mapping-based and assembly-based. In the former, you simply map reads to a database to determine classification and perform some type of functional profiling. In the latter, you assemble the reads first into genomes and then do classification and functional profiling.
What's unclear to me is, in the assembly approach, you might (given enough reads) be able to assemble a full genome of a hitherto unidentified species. But since this new species is not in the database, you wouldn't know what it is anyways. However, you could compare the sequences to see what it's similar to - but you can do that using the unassembled reads anyways.
So given the large computational resources required to assemble genomes, what additional value can be gained? Having an assembled genome is a 'nice to have' but if the goal is simply classification and functional profiling, then wouldn't a mapping-based approach be sufficient?
If the assembly is used to perform some sort of comparative genomics study, then that would be understandable. But how many shotgun metagenomics studies actually have a larger comparative genomics component? Not that many from my survey of the literature. Even if there is, the assembled genomes probably only make up a small fraction of all the genomes in the sample, so it's hardly a representative sample.
I'm a newbie to this field, so please tell me if my reasoning is wrong!
All valid points! I forgot about the extra resolution that can be gained from contigs vs. reads alone. This does raise some other thoughts.
You would generally get longer contigs from abundant species. Long contigs leads to better resolution. Thus, abundant species would be profiled better. In the case of a well characterized system such as the GI tract, these would be species in Bacteroides, Clostridium etc. Okay great, now you've got genomes of things that have already been studied in depth. But the interesting stuff may be in the rarer species. Since they're rare, you may not have enough reads to assemble a genome or even just longer contigs. Now you've wasted two days of cluster time to get genomes of things you already know about and not any closer to discovering something new about the novel species.
In the best case scenario, you've got enough reads to assemble the genome of an interesting novel species. You then do gene prediction, annotation and write a nice paper. But since the genome has not be deposited into the database, another scientist could de novo assemble, annotate and publish the same genome and neither of you may be aware you're talking about the same species. This issue is not unique to metagenomics, but is less of a problem with databases that are open and frequently updated. Granted, that in itself leads to some other problems.
I guess I'm still trying to convince myself of the value of shotgun metagenomics (over something like 16S rDNA profiling) given the time, money and computing resources required. It obviously depends on the research question but I'm sure many projects in this space are simply some PI having more funding than he or she knows what to do with.
16S still has its place for sure. You can’t beat it for cost. But, there’s no way around the abundance issue. If it’s a rare bacterium, you still may well miss it in a 16S study because you dont pick up enough starting material for amplification.
It’s a bit of a false equivalency to compare ‘reads only vs contigs vs amplicon’. 16S isn’t quite the same as profiling WGS/NGS reads only, and not to mention, it comes with a whole slew of its own biases. Anything where you’re starting from primers (however degenerate or broadly applicable they might be), means you’re only going to catch stuff that likes that ’bait’.
Longer contigs from the abundant organisms is all relative too. Yes, you’ll probably get closer to a full and closed genome, but even a low coverage set of reads can benefit from assembly, even if the assembly stats are weak. Unfortunately though you’re right - there isn’t a whole lot we can do to access this ‘unsequenced majority’, other than sequencing deeper and deeper, you’re always going to waste some reads on the common stuff though.
The point about 2 scientists discovering the same genome is valid and probably happens all the time - but the answer is pretty simple: they should be depositing their data.
NGS is trendy, which certainly helps get money, over something like 16S. Some researchers almost certainly do have more money than they know what to do with, and a lot of sequencing that goes on is largely surplus to requirements perhaps. But, how are the databases to ever improve, if we arent sequencing, and crucially depositing, more genomes - especially if its rare.