Question

Decontaminating plant genome assembly

0

Entering edit mode

2.5 years ago

Sadik Muzemil ▴ 20

When non-target sequence is detected inside scaffolds of plant genome (not at the start or end of the scaffolds) after NCBI submission quality check or local contaminant checking steps how do you deal to decontaminate your assembly?

In my case, most contaminants are ~38 bases and are fragments of Staphylococcus aureus YfhO family protein (NGB00725.1) and M42 family metallopeptidase (NGB00843.1)

Here is frequency of plot of the contaminant sequences.

enter image description here

Appreciate your thoughts on whether to split the scaffolds at the contaminant regions or exclude them. Do the approach to deal with decontaminating the assembly would depend on the sequencing platform on which the data is generated and the genome assembly method used?

Thanks,

decontaminate contaminants Genome assembly genome • 1.4k views

ADD COMMENT • link 2.5 years ago by Sadik Muzemil ▴ 20

2

Entering edit mode

This is really weird. I've seen some bacterial 16S in plant genomes but the distribution of length is weird. I would go to the raw reads and search for these sequences there to see where they came from, if the entire reads are contaminated and contain bacterial genes or just this region etc.

ADD REPLY • link 2.5 years ago by Asaf 10k

0

Entering edit mode

That would I think only help to trace back those non-target sequences in the raw reads. How would this help to decide the cleaning up options? The assembly needs to be decontaminated for submission.

ADD REPLY • link 2.5 years ago by Sadik Muzemil ▴ 20

2

Entering edit mode

What Asaf means is to

align all reads to the contaminant
then exclude these reads - eg samtools bamtofastq or whatever
then reassemble

ADD REPLY • link 2.5 years ago by colindaven 7.0k

1

Entering edit mode

Yes and no. You might find out that the contamination is greater than the algorithm predicted and then just masking those 38-mers won't be sufficient.

ADD REPLY • link 2.5 years ago by Asaf 10k

0

Entering edit mode

Thanks, colindaven I haven't looked at it that way, and could be one way to clean them up from raw reads and then eassembly. But I will not be able to do that for various and instead try to work out a plausible way to deal with the assembly at hand.

ADD REPLY • link 2.5 years ago by Sadik Muzemil ▴ 20

0

Entering edit mode

You could rather brutally mask the contaminants with Ns ... :

https://github.com/colindaven/blacklister

ADD REPLY • link 2.5 years ago by colindaven 7.0k

0

Entering edit mode

It does mask out the contaminants but I am not getting the ground for replacing contaminant sequences with gaps of Ns ?