Shell scripting looping help
4
0
Entering edit mode
2.6 years ago

Hello everyone,

I have two different folders,

  1. in folder1 I have 4000 fasta sequences.
  2. in folder2 I have blast database for each sequence. it means 3 files for each .

I want to run blastp for each fasta in folder1 with each database in folder2.

I tried it with a for loop and if loop. It did not work for me .

if loop example given below ,

#!/bin/bash 
blastp -query path to folder1/"$1".fasta -db path to folder2/"$1"_db -out test

This gives me the result for fasta 1 with database 1.

I want the result as fasta 1 with all databases then fasta 2 with all databases and so on.

What can be a better way?

Linux shell-scripting • 2.5k views
ADD COMMENT
0
Entering edit mode

I want the result as fasta 1 with all databases then fasta 2 with all databases and so on

Have you thought of using two nested for loops? Outer loop goes through samples one at a time. Inner one does the same for blast DB.

ADD REPLY
0
Entering edit mode

Hi, NO I have not tried that . Can you please give me an example for the same .

Thank you

ADD REPLY
0
Entering edit mode

Is this still your previous question framed in a different way again? Did you read and understand what was comented there? Amino Acid identity matrix help

ADD REPLY
0
Entering edit mode

NO its not the same question it is a different query itself .

ADD REPLY
0
Entering edit mode

well, it looks it is about the same 4000 proteins. Anyway, why don't you simply concatenate all sequences into one big blast database and then blast it against itself? Then you get the complete output in one run. No need for loops here.

ADD REPLY
0
Entering edit mode

Yes , because I am working on the same dataset but my question is different here . Here I am asking for how to loop the command that I want to perform .

ADD REPLY
0
Entering edit mode

Ok, no need for a loop...

ADD REPLY
0
Entering edit mode

if I run the entire protein fasta with entire db it will give me top 500 hits . there is a limitation . I want all the hits not just top 500

ADD REPLY
0
Entering edit mode
in folder2 I have blast database for each sequence. it means 3 files for each .

What does this mean?

ADD REPLY
0
Entering edit mode

I guess op means the three filed you get when you make a blast database from a fasta file:

        fourk.faa.phr   fourk.faa.pin   fourk.faa.psq
ADD REPLY
0
Entering edit mode

yes this three for each

ADD REPLY
1
Entering edit mode
2.6 years ago

let's do it with nextflow.

invoke with

nextflow run biostar9523782.nf -resume --qdir "/path/to/QUERYDIR" --tdir "/path/to/TARGETDIR"

ADD COMMENT
0
Entering edit mode
2.6 years ago
Michael 55k

Simple trick here, assuming your working directory contains all fasta files:

cat *.fasta > combined.faa
makeblastdb -in combined.faa -dbtype prot -parse_sequids
blastp -db combined.faa -query combined.faa -out allvsall.blast.txt -outfmt 7 -max_target_seqs 4000
ADD COMMENT
0
Entering edit mode

yes, if I run the entire protein fasta with entire db it will give me top 500 hits . there is a limitation . I want all the hits not just top 500

ADD REPLY
0
Entering edit mode

You can tune this parameter too and Blast will also obey values >500 (just tested, there is no hard limit or anything). However, that requires that blast has found at least one HSP. No alignments will be computed if that is not the case. In these cases, there may be less than 4000 sequences reported. In other cases, multiple hits may be reported for a single pair of sequences, e.g. if several parts of one sequence align to the other sequence. In any case, the estimate of sequence identity/similarity obtained from blast is only valid for that local aligned region and not the complete sequences (that is what local alignment means).

ADD REPLY
0
Entering edit mode
2.6 years ago
Mensur Dlakic ★ 28k

What you are trying to do is not an optimal way. Or to put it bluntly, it is a massive waste of time. Michael Dondrup already told you an efficient (and proper) way of how all-vs-all searches are done. For the rest:

blastp -help

It will show you that you can customize the number of descriptions in the output (-num_descriptions) and the alignments (-num_alignments). Simply put an arbitrarily large number after those switches, and then only -evalue will be the limiting factor for the number of hits shown.

ADD COMMENT
0
Entering edit mode
2.6 years ago
Malcolm.Cook ★ 1.5k

yes, if indeed you have 4000 blast databases in folder2, one for each fasta in folder1, you are creating too much work for yourself since you just want to blast a set of sequences against itself....

....so you just want to use blastps "BLAST-2-Sequences options"...

for this you don't need to create a blast database at all!

Oh, and in this mode blastp can work with streams you don't need to create a file with them all combined.

Assuming your shell is bash, try:

blastp -query <(cat folder1/*.fasta) -subject  <(cat folder1/*.fasta) -out test

no for loops needed. But in the future, forget loops for things like this. Install and learn to use GNU Parallel instead!

ADD COMMENT

Login before adding your answer.

Traffic: 2192 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6