Question

Shotgun metagenomics: Why bother with de novo assembly?

1

Entering edit mode

5.8 years ago

BioinformaticsLad ▴ 200

Hi everyone, I've started learning about shotgun metagenomics sequencing. My understanding is there are two main approaches: mapping-based and assembly-based. In the former, you simply map reads to a database to determine classification and perform some type of functional profiling. In the latter, you assemble the reads first into genomes and then do classification and functional profiling.

What's unclear to me is, in the assembly approach, you might (given enough reads) be able to assemble a full genome of a hitherto unidentified species. But since this new species is not in the database, you wouldn't know what it is anyways. However, you could compare the sequences to see what it's similar to - but you can do that using the unassembled reads anyways.

So given the large computational resources required to assemble genomes, what additional value can be gained? Having an assembled genome is a 'nice to have' but if the goal is simply classification and functional profiling, then wouldn't a mapping-based approach be sufficient?

If the assembly is used to perform some sort of comparative genomics study, then that would be understandable. But how many shotgun metagenomics studies actually have a larger comparative genomics component? Not that many from my survey of the literature. Even if there is, the assembled genomes probably only make up a small fraction of all the genomes in the sample, so it's hardly a representative sample.

I'm a newbie to this field, so please tell me if my reasoning is wrong!

metagnomics assembly • 1.8k views

ADD COMMENT • link 5.8 years ago by BioinformaticsLad ▴ 200

score 2 · Answer 1 · 2019-01-19

2

Entering edit mode

5.8 years ago

Joe 21k

In this specific case, you might well be right, that there isn’t anything super significant to be gained by profiling the metagenome at greater than the read level, but I can think of a few reasons that it’s more than just a ‘nice to have’.

Firstly, one might want to assemble to the Contig level to provide better ‘signal to noise’. I.e., its much easier to unambiguously say Contig X belongs to some organism, than it is to say a short read does. Particularly if that read happened to span, say, a highly conserved gene. Similarly, having contigs, and therefore coverage stats, is actually a legitimate read/sequence binning technique. This is how the metagenomics assembler CONCOCT works for instance, by binning contigs according to their coverage, the underlying assumption being that the DNA of an organism was present at some particular level, and therefore there should be a ‘coverage signal’ that corresponds to the original DNA abundance. You might not necessarily be able to get this from reads alone, again especially if there are many similar reads.

Mainly however I think its because just assigning the genomes is rarely the end goal, rather it’s just one of the first steps in some larger analysis. If you wanted (for instance) to do some taxonomic relationship studies/phylogenetics, it would be much better to do this at the contig level.

Plus, if you had found an entirely new genome that wasn’t in the existing databases, that’s not to say that you cannot go on to say useful things about the genome via annotation etc. It need not necessarily be present to be ‘interrogatable’.

Those are some things off the top of my head anyway, some people might have less hand-wavy ideas.

ADD COMMENT • link 5.8 years ago by Joe 21k

0

Entering edit mode

All valid points! I forgot about the extra resolution that can be gained from contigs vs. reads alone. This does raise some other thoughts.

You would generally get longer contigs from abundant species. Long contigs leads to better resolution. Thus, abundant species would be profiled better. In the case of a well characterized system such as the GI tract, these would be species in Bacteroides, Clostridium etc. Okay great, now you've got genomes of things that have already been studied in depth. But the interesting stuff may be in the rarer species. Since they're rare, you may not have enough reads to assemble a genome or even just longer contigs. Now you've wasted two days of cluster time to get genomes of things you already know about and not any closer to discovering something new about the novel species.

In the best case scenario, you've got enough reads to assemble the genome of an interesting novel species. You then do gene prediction, annotation and write a nice paper. But since the genome has not be deposited into the database, another scientist could de novo assemble, annotate and publish the same genome and neither of you may be aware you're talking about the same species. This issue is not unique to metagenomics, but is less of a problem with databases that are open and frequently updated. Granted, that in itself leads to some other problems.

I guess I'm still trying to convince myself of the value of shotgun metagenomics (over something like 16S rDNA profiling) given the time, money and computing resources required. It obviously depends on the research question but I'm sure many projects in this space are simply some PI having more funding than he or she knows what to do with.

ADD REPLY • link 5.8 years ago by BioinformaticsLad ▴ 200

2

Entering edit mode

16S still has its place for sure. You can’t beat it for cost. But, there’s no way around the abundance issue. If it’s a rare bacterium, you still may well miss it in a 16S study because you dont pick up enough starting material for amplification.

It’s a bit of a false equivalency to compare ‘reads only vs contigs vs amplicon’. 16S isn’t quite the same as profiling WGS/NGS reads only, and not to mention, it comes with a whole slew of its own biases. Anything where you’re starting from primers (however degenerate or broadly applicable they might be), means you’re only going to catch stuff that likes that ’bait’.

Longer contigs from the abundant organisms is all relative too. Yes, you’ll probably get closer to a full and closed genome, but even a low coverage set of reads can benefit from assembly, even if the assembly stats are weak. Unfortunately though you’re right - there isn’t a whole lot we can do to access this ‘unsequenced majority’, other than sequencing deeper and deeper, you’re always going to waste some reads on the common stuff though.

The point about 2 scientists discovering the same genome is valid and probably happens all the time - but the answer is pretty simple: they should be depositing their data.

NGS is trendy, which certainly helps get money, over something like 16S. Some researchers almost certainly do have more money than they know what to do with, and a lot of sequencing that goes on is largely surplus to requirements perhaps. But, how are the databases to ever improve, if we arent sequencing, and crucially depositing, more genomes - especially if its rare.

ADD REPLY • link 5.8 years ago by Joe 21k