Entering edit mode
3.4 years ago
jahanshahi.amin
•
0
Hi everyone, I'm using blastn command with the following parameters' values. but since I denoise the output data in several steps, I need a higher number of sequences as the output. what parameters can I add or change the value of it to increase the number of output sequences?
blastn -query input.fasta -db ref -out BlastResults.txt -num_threads 9 -outfmt '6 qseqid sseqid sseq' -word_size 6
So you want to artificially inflate the number of inputs going into your "denoising" workflow?
Could you please clarify?
Thank you for the respond. I have a defined platform in which the remain sequences filter in several steps. for example, at the first step the translated sequences which has a length of less than n should be removed . in the next step those sequences which their lengths is not multiple by m should be removed, and so on... So, I have several steps of filtering and since the number of final remained sequences is low, I need a higher number of sequences before starting the filtering (denoising) process. Now I'm asking how to increase the number of initial sequences by changing blastn parameters values. thanks.
To me it doesn't sound like any of these steps are dependent on the number of sequences passing through them.
If you intentionally manipulate sequence search results to boost the number of matches, you'll probably just end up with a lot of false positives (that might get filtered out by your pipeline anyway).
It's entirely possible you could obtain matches by adjusting sequence coverage, identity, and bitscore/evalue thresholds. But this depends.
Could you perhaps elaborate on what your workflow is actually trying to achieve?
The denoising process leds to finding best hits in the end. We are not going to manipulate the sequences. We are looking for the right parameters, setting, reference points etc to make sure we are collecting all the meaningful sequences. We have around 10s of million of NGS data and we should get around 10s of thousands of amino acid sequences.
I didn't imply that you will be manipulating the sequences themselves, but that you'd be manipulating the output of
BLAST
in terms of number of matched sequences. But I digress.I am not sure how you're defining "meaningful" here. If you just want a very broad set of matches, you should probably impose a very "low" e-value cut-off (like 10) and lower the coverage and sequence identity requirements as far as possible.
With that many sequences, you shouldn't be using
BLAST
anyway. I'd suggest looking intoMMseq2
. Might be worth taking a look atPlass
also, since you mention amino acid sequences from NGS data.