Hi, I have 10000 unassembled contigs from a metagenomic analysis. I have no idea which sequence belongs to which species. I ran Kraken classification but that is not enough as I still have half of the reads unclassified. I have collected the raw material for nanopore sequencing from an infected plant. I do not know what kind of pathogen (virus, fungus, or bacteria) caused it. So, I tried to blast each contig remotely using the following command:
blastn -query filtered.assembly.not.aligned.fasta -remote -db nr -out blastoutput_unassembled.txt -outfmt '6 qseqid sseqid evalue bitscore sgi sacc staxids sscinames scomnames stitle' -max_target_seqs 1
But this process has been running for the last 7 days and still only results from 3000 contigs are available.
Could you please suggest if this process can be accelerated or any other alternative solution for the purpose?
Kraken (or I think Centrifuge?) would be the way to go IMO. They are specifically designed for this task, which BLAST isn't really. Check you are using the latest database versions etc. I could be wrong but I would expect them to be using datasets close to if not the same completeness as NR.
The most immediate answer to your question though is : don't use remote blast. Install a local copy.
Thank you for the suggestion. Do you know of any NCBI database only for bacteria, viruses, and fungi? I have run Kraken and did not get what I was looking for since many of the reads were unclassified.
Which kraken database did you use? I agree that running a local version of BLAST should dramatically improve runtime, but I think I would alter approach a bit. If majority if your reads are not being classified as expected by a tool like Kraken, I would look closely into a handful of the unclassified reads. BLAST these, and look at all other databases available.
It sounds like you may have a contamination issue. If you used the correct one, the Kraken databases are very good at general classifications, so if samples are being unclassified, it sets off a few alarm bells for me.
I used the Standard and Viral database from this source https://benlangmead.github.io/aws-indexes/k2
I have around 120K reads and 50% of those were classified and the other half remained unclassified.
NCBI databases aren't broken down in that way as such, you have to filter by taxonomic ID numbers.