Dear all hi,
I have finalsed a de-novo assembly of a uncharacterised organism. To be able to check bacteria etc contamination we would like to use blast NT library.
When I try blast with default settings, I simply get 33Gb of data which I dont need it. My aim is to get single percentage output such like
Bacteria —> %33 similarity.
I have also attempted to get this result with megablast run with default settings. It did not help me neither.
Is this kind of summarised result possible ?
Thank you very much for the help,
Best regards,
Tunc.
Try Kraken, its faster and can generate the kind of output you are asking for.
In what form/format is this assembly? Contigs? Number of them? Why are you getting 33G of data? How you are running your blast?
The assembly is ~900 MB size (in fasta). The length is .9 Gbps.
I tried;
Make your blast more stringent. Look for long alignments since most of the short ones are likely going to be spurious/by chance.
That said you are on a slippery slope here. If contamination was suspected in your original data then it should have been taken care of before doing the assembly. Now you can not be sure if some of the sequence in your assembly should not be there in first place.
Yeah yeah you are totally right about the precaution beforehand. Currently, I am just doing this to check to find the most obvious ones.
I dont want to mess with the default word size and mismatch levels. Is there way that I can make this tuning simpler?
Check the BlobTools docs for a suggestion of blast parameters. In particular, set stringent e-value cutoff, use a custom format output, and restrict the number of target sequences and hsps returned (there is a somewhat hidden side-effect of
-max_target_seqs
, though):