Question

How to increase the number of matched sequences in blastn

0

Entering edit mode

3.4 years ago

jahanshahi.amin • 0

Hi everyone, I'm using blastn command with the following parameters' values. but since I denoise the output data in several steps, I need a higher number of sequences as the output. what parameters can I add or change the value of it to increase the number of output sequences?

 blastn -query input.fasta -db ref -out BlastResults.txt -num_threads 9 -outfmt '6 qseqid sseqid sseq' -word_size 6

mismatch blastn • 1.4k views

ADD COMMENT • link updated 3.4 years ago by Dunois ★ 2.8k • written 3.4 years ago by jahanshahi.amin • 0

0

Entering edit mode

I denoise the output data in several steps, I need a higher number of sequences as the output.

So you want to artificially inflate the number of inputs going into your "denoising" workflow?

Could you please clarify?

ADD REPLY • link 3.4 years ago by Dunois ★ 2.8k

0

Entering edit mode

Thank you for the respond. I have a defined platform in which the remain sequences filter in several steps. for example, at the first step the translated sequences which has a length of less than n should be removed . in the next step those sequences which their lengths is not multiple by m should be removed, and so on... So, I have several steps of filtering and since the number of final remained sequences is low, I need a higher number of sequences before starting the filtering (denoising) process. Now I'm asking how to increase the number of initial sequences by changing blastn parameters values. thanks.

ADD REPLY • link 3.4 years ago by jahanshahi.amin • 0

0

Entering edit mode

To me it doesn't sound like any of these steps are dependent on the number of sequences passing through them.

If you intentionally manipulate sequence search results to boost the number of matches, you'll probably just end up with a lot of false positives (that might get filtered out by your pipeline anyway).

It's entirely possible you could obtain matches by adjusting sequence coverage, identity, and bitscore/evalue thresholds. But this depends.

Could you perhaps elaborate on what your workflow is actually trying to achieve?

ADD REPLY • link 3.4 years ago by Dunois ★ 2.8k

0

Entering edit mode

The denoising process leds to finding best hits in the end. We are not going to manipulate the sequences. We are looking for the right parameters, setting, reference points etc to make sure we are collecting all the meaningful sequences. We have around 10s of million of NGS data and we should get around 10s of thousands of amino acid sequences.

ADD REPLY • link 3.4 years ago by jahanshahi.amin • 0

0

Entering edit mode

I didn't imply that you will be manipulating the sequences themselves, but that you'd be manipulating the output of BLAST in terms of number of matched sequences. But I digress.

I am not sure how you're defining "meaningful" here. If you just want a very broad set of matches, you should probably impose a very "low" e-value cut-off (like 10) and lower the coverage and sequence identity requirements as far as possible.

With that many sequences, you shouldn't be using BLAST anyway. I'd suggest looking into MMseq2. Might be worth taking a look at Plass also, since you mention amino acid sequences from NGS data.

ADD REPLY • link 3.4 years ago by Dunois ★ 2.8k

score 0 · Answer 1 · 2021-06-14

0

Entering edit mode

3.4 years ago

JC 13k

Blastn has a max_target_seqs flag, the default is 500, however, as you are using default parameters, I believe you already are getting a maximal number of hits. In case you want to check other settings check https://www.ncbi.nlm.nih.gov/books/NBK279684/