Please how do I remove bacteria sequences from an assembled plant genome contaminated with bacteria sequences. I submitted the genome to NCBI, it was returned that it contains some bacteria contaminants.
contig_300 186321 50454..51783 vector/etc contig_393 236987 159928..201099,201296..201332,201613..201734 Bacillus subtilis contig_406 88193 18501..32492 vector/etc contig_578 254470 61645..63660,72386..74400,108384..118386 Bacillus subtilis contig_809 95011 15573..15654,15813..15890,23371..24971,66901..66933 Bacillus subtilis subsp. subtilis str. BSP1 contig_82 168720 42435..42507,42846..43380,83644..89759,104130..104213,105875..119446,163916..164004,165518..165612,168641..168720 Bacillus subtilis contig_893 59927 57968..58241,58518..59926 Bacillus subtilis
Any advice on how to remove these sequences?
But why do you want to do that? If you are trying to get scaffolds, the contamination will just not match. If you want to map the contigs to a reference they won't match either. I'm curious about for what use of the contigs removing contaminants is a need.
1) Imagine that you sequenced some plant and deposited its genome in GenBank without removing contamination. Then, all the bacterial sequences that you deposited will be indicated in GenBank as plant sequences. This will highly decrease the precision of taxonomic classification for future scientists, for example if they taxonomically classify sequences from metagenomes. People who deposit sequences without removing contamination is a pure catastrophe of GenBank.
2) "If you want to map the contigs to a reference they won't match either". You're wrong. For example, mitochondria and plastids originated from bacteria and their sequences will match bacterial sequences. Also, there are bacterial sequences transferred into nuclear genomes of plants (https://www.pnas.org/content/112/18/5844), so there may also be matches between bacterial contamination and reference nuclear genomes.
3) If you perform a genome annotation without removing contaminants, there will be many genes that you'll consider belonging to your plant genome, while in fact they do not. This may make conclusions about the gene content of your plant species incorrect.
I don't agree with point 2. The contigs will not align except in particular areas. If contigs are large enough that's unlikely a problem. Only depending on what mapping algorithm you use that could be a problem. It's not with what I usually use. I can understand point 1. Not so much point 3, I think if you arrive at the problem in point 3 you are doing a list of things wrong, and removing contamination is just a quick fix that in some cases (not sure what proportion) will also lead to wrong conclusions as the algorithms for removing contamination are very unlikely perfect. Thanks for the answer. I see now why it's usually justified.
It is common practice to remove bacterial contigs from eukaryote genome assemblies, and GenBank won't let you upload contaminated assemblies.