When non-target sequence is detected inside scaffolds of plant genome (not at the start or end of the scaffolds) after NCBI submission quality check or local contaminant checking steps how do you deal to decontaminate your assembly?
In my case, most contaminants are ~38 bases and are fragments of Staphylococcus aureus YfhO family protein (NGB00725.1) and M42 family metallopeptidase (NGB00843.1)
Here is frequency of plot of the contaminant sequences.
Appreciate your thoughts on whether to split the scaffolds at the contaminant regions or exclude them. Do the approach to deal with decontaminating the assembly would depend on the sequencing platform on which the data is generated and the genome assembly method used?
Thanks,
This is really weird. I've seen some bacterial 16S in plant genomes but the distribution of length is weird. I would go to the raw reads and search for these sequences there to see where they came from, if the entire reads are contaminated and contain bacterial genes or just this region etc.
That would I think only help to trace back those non-target sequences in the raw reads. How would this help to decide the cleaning up options? The assembly needs to be decontaminated for submission.
What Asaf means is to
Yes and no. You might find out that the contamination is greater than the algorithm predicted and then just masking those 38-mers won't be sufficient.
Thanks, colindaven I haven't looked at it that way, and could be one way to clean them up from raw reads and then eassembly. But I will not be able to do that for various and instead try to work out a plausible way to deal with the assembly at hand.
You could rather brutally mask the contaminants with Ns ... :
https://github.com/colindaven/blacklister
It does mask out the contaminants but I am not getting the ground for replacing contaminant sequences with gaps of Ns ?
https://www.drive5.com/usearch/manual/masking.html