Question

How to assemble viral genomes when my data contains host DNA as well

1

Entering edit mode

4.8 years ago

GBC_Zonatos ▴ 10

I'm currently trying to assemble a viral genome, but am unsure about how to proceed on that, as my samples contain both viral DNA and bacterial DNA (from it's host).

I'm using a pipeline that we usually use for bacterial assemblies without problems, using A5 and SPADES to assemble the contigs, and then using both assemblies on GMCloser in order to try and close any gaps. We get very good results for bacteria, and we seem to have achieved good results on the viral DNA as well, managing to find 42 scaffolds, two of them with coverage over 2000. One of these two scaffolds matched to our host bacteria on a Blast alignment against NCBI, while the other matched to a viral genome, similar to what we expected. This viral scaffold, then, was the one with the highest overall/average coverage (cov > 2000), with a length of 40kbp, aligning to a known virus that infects the host cell we found on our samples. It seems like we managed to recover most of the genome, as the complete genome of virus it aligned to is also around 40kbp long.

I'm unsure of how to check for contaminations on that scaffold, though. It appears to be of the right length, and after blasting it on NCBI I've found a few similar virus, for which I retrieved their complete genomes, and compared them with ANI (using mummer alignment), which indeed showed that 35350bp (87.79% of my genome) aligned to a reference viral genome. Using Genome Detective (https://www.genomedetective.com/app/typingtool/virus/) I've found that it aligned with 94% coverage/concordance to a specific viral genome, which seems to confirm that it had a good alignment.

Is there any other steps I can use to search this scaffold for host DNA, in case some DNA was badly assembled? I've ran all scaffolds through the 'Genome Detective' tool mentioned above, and only found viral DNA on one other scaffold, on which the tool detected only 3% alignment, which leads me to thinking that scaffold is actually from the host, and that this 3% alignment would be coming from sequences shared between a virus and the host itself. I'm wondering if my 'viral scaffold' might also contain 'shared sequences' and, if that's the case, if any chimeras could have been generated on the assembly, mixing host DNA into it.

Looking for some input from anyone more experience with viral assemblies.

Assembly assembly virus genome contamination • 2.0k views

ADD COMMENT • link updated 4.8 years ago by Antonio R. Franco ★ 5.2k • written 4.8 years ago by GBC_Zonatos ▴ 10

score 3 · Answer 1 · 2020-01-24

3

Entering edit mode

4.8 years ago

colin.kern ★ 1.1k

If the host bacteria has a known genome assembly, you can use any short read aligner, e.g. BWA or Bowtie, to align your raw reads to the bacterial genome. Then take the unaligned reads and run your assembly pipeline just on those.

ADD COMMENT • link 4.8 years ago by colin.kern ★ 1.1k

score 3 · Answer 2 · 2020-01-24

From what you describe, it seems like you have a clean co-assembly of a virus and its host. You already have a suggestion to remove host-mapping reads, which I think is worth trying.

Couple of additional suggestions: 1) check the completeness of your viral and host contig bins using CheckM. It will estimate the host genome completeness which is probably good to know, and if everything is correct it should designate your viral contig into root category with 0% completeness. That would tell you indirectly that a viral DNA is not cellular. 2) Do tetra-nucleotide (or penta-) frequency embedding using PCA or tSNE on all your contigs. You have lots of choices here: I like MetaBAT and CONCOCT, and VizBin is pretty user-friendly. Any of them should work as viral contigs are normally clearly separable from bacterial contigs.

score 2 · Answer 3 · 2020-01-24

2

Entering edit mode

4.8 years ago

onestop_data ▴ 330

I agree with @colin.kern. If the host is not known, you can try to use a tool such as Metabat which uses unsupervised methods to create bins for each organism given contigs from the mixed community - in your case the virus and the host.

ADD COMMENT • link 4.8 years ago by onestop_data ▴ 330

score 1 · Answer 4 · 2020-01-25

Another possibility:

If you know the bacterial genome, you have the chance to get rid from most of their sequences by filtering the reads by using BBSplit

Then, you need to assemble again with the filtered reads. Neither a mapping with bowtie or a filtering with BBSplit can guarantee you can get rid of all bacterial sequences, since some portion of your reads will be not present in the bacterial genome you use