Question

How can I remove contaminants from an assembled genome?

3

Entering edit mode

4.0 years ago

eennadi ▴ 40

Please how do I remove bacteria sequences from an assembled plant genome contaminated with bacteria sequences. I submitted the genome to NCBI, it was returned that it contains some bacteria contaminants.

contig_300 186321 50454..51783 vector/etc contig_393 236987 159928..201099,201296..201332,201613..201734 Bacillus subtilis contig_406 88193 18501..32492 vector/etc contig_578 254470 61645..63660,72386..74400,108384..118386 Bacillus subtilis contig_809 95011 15573..15654,15813..15890,23371..24971,66901..66933 Bacillus subtilis subsp. subtilis str. BSP1 contig_82 168720 42435..42507,42846..43380,83644..89759,104130..104213,105875..119446,163916..164004,165518..165612,168641..168720 Bacillus subtilis contig_893 59927 57968..58241,58518..59926 Bacillus subtilis

Any advice on how to remove these sequences?

assembly • 5.8k views

ADD COMMENT • link updated 23 months ago by shelkmike ★ 1.4k • written 4.0 years ago by eennadi ▴ 40

0

Entering edit mode

But why do you want to do that? If you are trying to get scaffolds, the contamination will just not match. If you want to map the contigs to a reference they won't match either. I'm curious about for what use of the contigs removing contaminants is a need.

ADD REPLY • link 4.0 years ago by juanjo75es ▴ 130

1

Entering edit mode

1) Imagine that you sequenced some plant and deposited its genome in GenBank without removing contamination. Then, all the bacterial sequences that you deposited will be indicated in GenBank as plant sequences. This will highly decrease the precision of taxonomic classification for future scientists, for example if they taxonomically classify sequences from metagenomes. People who deposit sequences without removing contamination is a pure catastrophe of GenBank.

2) "If you want to map the contigs to a reference they won't match either". You're wrong. For example, mitochondria and plastids originated from bacteria and their sequences will match bacterial sequences. Also, there are bacterial sequences transferred into nuclear genomes of plants (https://www.pnas.org/content/112/18/5844), so there may also be matches between bacterial contamination and reference nuclear genomes.

3) If you perform a genome annotation without removing contaminants, there will be many genes that you'll consider belonging to your plant genome, while in fact they do not. This may make conclusions about the gene content of your plant species incorrect.

ADD REPLY • link 4.0 years ago by shelkmike ★ 1.4k

0

Entering edit mode

I don't agree with point 2. The contigs will not align except in particular areas. If contigs are large enough that's unlikely a problem. Only depending on what mapping algorithm you use that could be a problem. It's not with what I usually use. I can understand point 1. Not so much point 3, I think if you arrive at the problem in point 3 you are doing a list of things wrong, and removing contamination is just a quick fix that in some cases (not sure what proportion) will also lead to wrong conclusions as the algorithms for removing contamination are very unlikely perfect. Thanks for the answer. I see now why it's usually justified.

ADD REPLY • link 4.0 years ago by juanjo75es ▴ 130

0

Entering edit mode

It is common practice to remove bacterial contigs from eukaryote genome assemblies, and GenBank won't let you upload contaminated assemblies.

ADD REPLY • link 3.8 years ago by Michael 55k

score 3 · Answer 1 · 2020-12-24

3

Entering edit mode

4.0 years ago

shelkmike ★ 1.4k

I usually do the following:
1) Align all contigs by BLASTN to the NCBI NT database.
2) Using the NCBI Taxonomy database determine taxonomies of best matches.
3) Remove contigs whose best matches were from improper taxons. If you assemble a genome of a plant belonging to Embryophyta, you may want to remove all contigs that have matches in NCBI NT, but whose best matches were not from Embryophyta.

Keep in mind that you should also retain contigs that have no matches in NCBI NT. They likely belong to non-coding regions of your plant, since non-coding regions evolve rapidly.

ADD COMMENT • link 4.0 years ago by shelkmike ★ 1.4k

0

Entering edit mode

A little update. Now, instead of looking at one best match, I look at top 5 matches. If most of them belong to a wrong taxon, I consider this contig a contamination. The method with a single best match is slightly less accurate because some sequences in NCBI nt are incorrectly taxonomically annotated. For example, if one assembled a plant genome and did not remove bacterial contamination, these bacterial contigs will also be taxonomically annotated as plant sequences in NCBI nt.

ADD REPLY • link 23 months ago by shelkmike ★ 1.4k

score 1 · Answer 2 · 2020-12-24

There are many different ways of doing this, and all of them attempt to classify the sequences before you submit them. You already have a good and straightforward suggestion, but beware that using that approach will take some time.

Yet another way is to do dimensionality reduction on your sequences, which will try to find a low-dimensional structure in your data. This can be done by first creating high dimensional data from your sequence. A straightforward count of tetranucleotide frequencies for your sequences will give you either 256 of 136 columns of data (depending on how one counts tetranucleotides). Those can be embedded in 2D space, which for distantly related DNA sequences like in your case should give a clear separation between different species.

Here are some links that explain how this works in general:

To demonstrate how this works, I have mixed DNAs from 3 random bacterial genomes and human chromosome 22. Hopefully it will be clear from the images whether DNA is human or not, even though there are couple of clusters that merit further investigation.

Here is how t-SNE separates the sequences:

enter image description here

And here is the same data embedded by UMAP:

enter image description here

score 0 · Answer 3 · 2022-09-22

Although Mensur Dlakic didn't provide all the steps necessary to do this clustering with sequence data, PhylOligo provides the means in one package.

PhylOligo: a package to identify contaminant or untargeted organism sequences in genome assemblies, Ludovic Mallet, Tristan Bitard-Feildel, Franck Cerutti, Hélène Chiapello, Bioinformatics, Volume 33, Issue 20, 15 October 2017, Pages 3283–3285, https://doi.org/10.1093/bioinformatics/btx396

I've used it several times and it works well, except for the edges of the clusters which are sometimes placed in the unclustered bin.