Hi
Assembly rookie here. I have been BLASTing some of my contigs from a de novo assembly I did, and the results are coming back with very poor query coverage. Since this is an environmental isolate, I guess it's possible that there aren't many whole genome records. I tested contigs of varying sizes (largest 338 075 nt with total read count of 35 131 and coverage of 26; down to smaller contigs in the 1500 nt range with total read count of 1647). What I want to know is what is the chance that my trouble with identification is due to a) contamination in the DNA; or b) that my assembly is so shitty BLAST can't make a match?
Thanks in advance
Have you taken a few of the original reads and blasted them to make sure they show reasonable/expected hits?
When you have nice, long contigs that don't BLAST to anything, it's likely that it's a novel organism (at least, distant compared to anything in your BLAST database). Metagenomes usually have low coverage, and assemblers don't tend to output random junk as long contigs. Contamination is more likely to give you good BLAST hits than the metagenomic target, since the same contamination tends to be seen in lots of labs. Thus, contamination is not a very likely explanation for this outcome.
Oops, I misread and thought this was a metagenome rather than an isolate. If the species you expect is in your BLAST database, this is not the species you expect.
It depends on how you are defining contamination. If you think you are working with species A, but your assembly gives you species B, that could be considered contamination even if it's a great assembly. It's not clear if that's the case here.