I am trying to assemble the genome of a tree, and I have strong evidence that the DNA could be heavily contaminated with a foreign DNA. It is not the leaves were used were externally contaminated with other organisms. I am talking about a millennial tree that could be supporting another form of live internally
How can I discover what is the source of that DNA? BlastN done with billions of sequences is not an alternative..
My first thought was to use FastQ Screen, but I guess that's not an option with billions of sequences. Could you cluster similar sequences into groups, then run FastQ Screen or blastn on a representative sequence from each group? You could prioritise this by running the search on decreasing group sizes. If you have a lot of contamination the first number of groups you try would be from an alternative species?
I see that FastQ Screen is somehow similar to bbsplit.sh from the set of bbmap tools
I have already used some genomes with bbsplit, and now I know that only 0,0002% of the reads are from Verticillium (as an example)..
But this approach means that I have to download the genome(s) and have some serendipity and luck in finding the contaminated genome..
I am looking for an approach similar to a classic metagenomic study, in which you give the sequences and the program or/and service will find for me the source of contamination.