Is this still your previous question framed in a different way again? Did you read and understand what was comented there? Amino Acid identity matrix help
well, it looks it is about the same 4000 proteins. Anyway, why don't you simply concatenate all sequences into one big blast database and then blast it against itself? Then you get the complete output in one run. No need for loops here.
Yes , because I am working on the same dataset but my question is different here . Here I am asking for how to loop the command that I want to perform .
nextflow run biostar9523782.nf -resume --qdir "/path/to/QUERYDIR" --tdir "/path/to/TARGETDIR"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
You can tune this parameter too and Blast will also obey values >500 (just tested, there is no hard limit or anything). However, that requires that blast has found at least one HSP. No alignments will be computed if that is not the case. In these cases, there may be less than 4000 sequences reported. In other cases, multiple hits may be reported for a single pair of sequences, e.g. if several parts of one sequence align to the other sequence. In any case, the estimate of sequence identity/similarity obtained from blast is only valid for that local aligned region and not the complete sequences (that is what local alignment means).
What you are trying to do is not an optimal way. Or to put it bluntly, it is a massive waste of time. Michael Dondrup already told you an efficient (and proper) way of how all-vs-all searches are done. For the rest:
blastp -help
It will show you that you can customize the number of descriptions in the output (-num_descriptions) and the alignments (-num_alignments). Simply put an arbitrarily large number after those switches, and then only -evalue will be the limiting factor for the number of hits shown.
yes, if indeed you have 4000 blast databases in folder2, one for each fasta in folder1, you are creating too much work for yourself since you just want to blast a set of sequences against itself....
....so you just want to use blastps "BLAST-2-Sequences options"...
for this you don't need to create a blast database at all!
Oh, and in this mode blastp can work with streams you don't need to create a file with them all combined.
Assuming your shell is bash, try:
blastp -query <(cat folder1/*.fasta) -subject <(cat folder1/*.fasta) -out test
no for loops needed. But in the future, forget loops for things like this. Install and learn to use GNU Parallel instead!
Have you thought of using two nested
for
loops? Outer loop goes through samples one at a time. Inner one does the same for blast DB.Hi, NO I have not tried that . Can you please give me an example for the same .
Thank you
Is this still your previous question framed in a different way again? Did you read and understand what was comented there? Amino Acid identity matrix help
NO its not the same question it is a different query itself .
well, it looks it is about the same 4000 proteins. Anyway, why don't you simply concatenate all sequences into one big blast database and then blast it against itself? Then you get the complete output in one run. No need for loops here.
Yes , because I am working on the same dataset but my question is different here . Here I am asking for how to loop the command that I want to perform .
Ok, no need for a loop...
if I run the entire protein fasta with entire db it will give me top 500 hits . there is a limitation . I want all the hits not just top 500
What does this mean?
I guess op means the three filed you get when you make a blast database from a fasta file:
yes this three for each