I've been getting partial matches when comparing assemblies to BLAST database. I was told that the partial matches of contigs to BLAST database are due to chimera (unrelated sequences that are wrongfully merged during an assembly process). For instance, a little over a half of a contig of about 880k base pairs matched a certain bacteria, which must be a result of sample contamination. Given that chimera is the case, I thought that a correct next step might be to remove the contaminated sequences from the chimera contigs, which will hopefully keep only the DNA of interest. Then I'll be able to align these sequences to a reference genome or to other genomes of closely related species, to find genetic differences between them, which is my research question.
I'd love to get your opinions whether that's a good idea or otherwise.
What species are you working on? Is there any chance of HGT?
If it is just one contig, is there a problem to remove it?
Considering it has been assembled, you should align the reads assembled and have a look at the junction you believe to have generated the chimeric structure to see if the reads confirm it.
Lastly you can try another assembler to see if it also generates this chimeric contig.
It's a Myxozoa. Horizontal gene transfer? That would be between bacteria and Myxozoa. Having a sample contamination is a very likely scenario, though I don't know how to confirm occurrence of HGT or to refute it. Do you?
By "align the reads assembled" you means align the assembled scaffolds to the original raw reads?
Yup, same as below, align the raw reads against the assembly and check the boarder between the DNA originating from myxozoa and the bacteria? See if the reads confirm that the structure is true or not?
I had assumed long reads but perhaps that is not the case?
to immediately start of with the worst news: this is one of the most difficult things to get resolved in assembly.
rerunning the whole assembly process again with for instance more stringent settings is perhaps the best option but usually not feasible.
There are probably also tools around that can somewhat fix this but it remains difficult. One thing you can try yourself is to map the reads back to your assembled contigs and inspect the mapping coverage, the thing to look for are 'drops' in the coverage indicating that something might be going on there and as such you might be able to pinpoint where the chimera is located.
Simply removing all contigs that have some contamination match is also an option but you will chuck out many true sequences as well. Moreover to make it all even more difficult: it's not because some parts of the contigs matches bacteria that that part is also not actually true sequence from your assembly (there are eukaryotic sequences that matches bacteria as well).
If you want to remove contamination contigs from your assembly it will be worth to take some other info into account as well. for instance you can check the %GC of your contigs (bact and euk contigs will have seriously different %GC content) in combination with your blast matches.
If a whole contig has nothing but bacterial matches and a %GC that is very deviant from the mean of your whole assembly it's a high chance it's a contamination indeed.
long story short: it's a difficult issue to resolve
Are we talking contigs or scaffolds here btw? (contigs or even worse in this setting than scaffolds)
yes, with the 'mapping back' I mean indeed that (map raw reads back to the assembly).
if they are scaffolds, then you might be able to split those back up into contigs. Scaffolds are contigs joined together with stretches of Ns (usually). so If then one part seems to be contamination and the other not AND it coincides with stretch of Ns, you can split the scaffold on that stretch of Ns and only throw away the part matching bacteria and retain the other part in your assembly
What species are you working on? Is there any chance of HGT?
If it is just one contig, is there a problem to remove it?
Considering it has been assembled, you should align the reads assembled and have a look at the junction you believe to have generated the chimeric structure to see if the reads confirm it. Lastly you can try another assembler to see if it also generates this chimeric contig.
Good luck
Thanks.
It's a Myxozoa. Horizontal gene transfer? That would be between bacteria and Myxozoa. Having a sample contamination is a very likely scenario, though I don't know how to confirm occurrence of HGT or to refute it. Do you?
By "align the reads assembled" you means align the assembled scaffolds to the original raw reads?
Yup, same as below, align the raw reads against the assembly and check the boarder between the DNA originating from myxozoa and the bacteria? See if the reads confirm that the structure is true or not?
I had assumed long reads but perhaps that is not the case?