We got the first genomic draft assembled from a 1,2Gb organism using PacBio. We got over 4000 long contigs
Now we want to assess whether this genome draft is contaminated or not with fungal sequences or any other origen
Most metagenomic analyzer use short reads or metagenomic contigs not so long like when trying to obtain a genome
Are you aware of any tools that will allow such analysis ?
There is a subtle consideration in my question. ¿Can you use these tools with very long contigs made up with already assembled PacBio o ONT reads?. I can use the reads, but the amount of data is huge. And I am considering the probability of using the assembled fasta files, which are vastly longer than the reads themselves
Ah, I glossed over that aspect of the question. In that case, I would suggest
kraken2
might be the best fit given it's a kmer-based approach, which in theory should increase confidence of calls when read length is increased (even with high error rates, you should get a few exact matches between the read and correct mapping locations). Though, if you've already polished the assembly with Illumina reads, then the error rate should be reduced anyway. However, I am unsure what you'll be able to do with the report and what the next steps would be if you did find contamination.Whilst I like the concept of
MetaMaps
more, it seems to take into account the frequency distribution of reads to inform posteriors about the composition and location of hits, so I think a draft genome will violate some of the assumptions here. But I've only skimmed the paper, so I may have misinterpreted this.you certainly focused my interest in using MetaMaps. I am reading the paper now.
Do you have a related genome available that you could first use to see if you can pare the list of potential "contaminated" contigs down? Aligning with
minimap2
would likely be one option. More detailed analysis can then be done by a local aligner like blast to identify HSP's.But at some point you will want to examine the original reads by aligning them back to the assembly you have.
I am aware of minimap2 and its recent evolution ( a very active project indeed). But certainly, I rather use a truly and overall metagenomic analysis. I have not idea about what the contaminant could be in advance.
After looking for information, I am convinced that most of metagenomic tools are mainly designed to use short reads. Very recently, a new set of tools are appearing for long reads such as those coming from PacBio and ONT
But I am certainly interested in knowing whether this went far from these two alternatives. I think it would be interested in assessing whether a whole assembled genome is or not contaminated. I know that this assessment should be done before the assembly procedure, and that tools such as BBMap or any other filtering tools can be used for cleaning and separating your reads. In addition, I have the feeling that the analysis of an entire genome will require less computer resources as you get rid of reiterative and redundant reads (i.e. you ended with 1,5Gb of data after assembling 40, 50 or more Gb of reads)
But after mapping certain Illumina reads to a presumible mature genome, I found convincing evidence that this published genome was contaminated not only with adapter sequences badly filtered before the assembly, but with a presumible biotrophic fungal population
Thus, my interest in knowing whether you can assess the metagenomic population of already assembled genomes, because in many cases, either you don't have access to the original reads or you can save on computer resources
That is going to be tricky at best as you probably realize :-) It sounds like you actually want to check a published genome and not the one you assembled.
I was referring to somebody else's genome. I have access to my reads, of course.., Lol. To be clear. I was working with the olea genome that I did not assemble myself nor I have access to the original reads. I got evidence that it was contaminated
I was referring to the case of other genome as well. Metagenomic assemblies are an "entity" one generates (which may not necessarily be a reflection of biology). We may never know the actual truth since we don't have an idea of what was there to begin with.