Question

Forum:Confirmation of metagenomic data

1

Entering edit mode

9.7 years ago

Whetting ★ 1.6k

Hi,

As a community we are unsure about how to deal with new viruses discovered through metagenomic tools. Currently, the rules state that in order to be considered a novel (papillomavirus) isolate the entire genome has to be cloned and sequenced. However, more and more people are choosing not to follow this rule anymore, especially with metagenomic data. We would like to develop a new set of rules that would allow "metagenomic genomes" to be considered "real" [edit based on Josh's answer: By "real" I mean is the identified genome the natural occurring complete sequence of this virus (e.g. assembly induced hybrids)] . One of the concerns is the reproducibility of the assembly method. We are percolating the idea of having two independent labs perform the assembly de novo as a confirmation step.

I have two questions for you guys and gals:

Does this sound reasonable, too strict, not strict enough?
Obviously, assembly is a time consuming thing and isn't trivial. Would you guys like to share some thoughts on preferred assemblers, pipelines, etc...

as always,

Thanks!

metagenomics assembly dnaSeq • 2.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Whetting ★ 1.6k

Ram · Answer 1 · 2015-08-13

1. Does this sound reasonable, too strict, not strict enough?

I'm a little thrown off by your question -- of course genomes from mixed samples are real genomes. You're trying to establish standards with which to describe new species of viruses from metagenomic data. There are already a lot of groups doing this, so I think standards are worked out to a certain extent. People have been defining organisms on the basis of their DNA for 4 (or more decades now), why should more data (really the only difference you have with metagenomic data, as we have been sequencing mixed samples through cloning, etc., for the last 30 years) change things?

Here's a commentary I co-authored earlier this year on the systematics and taxonomy of environmentally derived sequence data (focused on plants as a host - but human host associated papillomaviruses are not any different in an ecological sense), I hope it helps.

2. Obviously, assembly is a time consuming thing and isn't trivial. Would you guys like to share some thoughts on preferred assemblers, pipelines, etc...

This is a quickly evolving and constantly changing field right now and I don't think the community has come up with a preferred pipeline or system. I feel like I could write a book on what to do and what not to do here, but I think you have to dive into the literature.

The main assembly program I use now for metagenomics (meghit), didn't exist a year ago, so it's hard to gauge standards at the moment. There are tons of new great tools, and twice as many poor tools out there.

For what you are working on with papillomaviruses, I would highly recommend the pipeline (though it's less than 2 years old and probably dated at this moment) from this really good paper from Ital Sharon in Jill Banfield's Lab: Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization.

The bottom line (Istvan mentions this in his answer) is the fear of assembly chimeras or misassembled metagenomic genomes. How do you know what you have assembled is actually the correct genome in your sample? Long reads will change this field, but for now, you have to be extra careful making claims you have a new organism on the basis of metagenome assembly. I would look at any metagenome assembly or short-read annotation with skepticism.

Ram · Answer 2 · 2015-08-13

3

Entering edit mode

9.7 years ago

Istvan Albert 102k

I would suggest validating via platforms such a the MiniION. Currently this produces error prone but very long reads - assembling from that is a bit tedious (to say the least) but verifying assemblies with it is very straightforward. There is nothing that proves that an assembly is correct than (even a messy) alignment over the entire length. The problems with assemblies are usually not about the sequence identities but assembling unrelated fragments.

In silico assemblies especially from metagenomic data run the risk of assembling chimeric data, have a really hard time with genomes that may share similarities or those that come from diverse populations.

ADD COMMENT • link 9.7 years ago by Istvan Albert 102k

0

Entering edit mode

That's an interesting point. It may be cost prohibitive for the "confirming" lab to use something like MiniION. Likewise, requiring a certain technology may not be the best avenue either?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Whetting ★ 1.6k

1

Entering edit mode

A MinION device costs $1000 the flow cell can be run multiple times, and you can run it to validate different findings.

I could see this being offered commercially as well. We are just not trained to think in terms of: All I need is 10 good reads that are covering my entire virus.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Istvan Albert 102k