Shell scripting looping help

0

Entering edit mode

3.1 years ago

Confused_human ▴ 30

Hello everyone,

I have two different folders,

in folder1 I have 4000 fasta sequences.
in folder2 I have blast database for each sequence. it means 3 files for each .

I want to run blastp for each fasta in folder1 with each database in folder2.

I tried it with a for loop and if loop. It did not work for me .

if loop example given below ,

#!/bin/bash 
blastp -query path to folder1/"$1".fasta -db path to folder2/"$1"_db -out test

This gives me the result for fasta 1 with database 1.

I want the result as fasta 1 with all databases then fasta 2 with all databases and so on.

What can be a better way?

Linux shell-scripting • 3.3k views

ADD COMMENT • link updated 3.0 years ago by Malcolm.Cook ★ 1.5k • written 3.1 years ago by Confused_human ▴ 30

0

Entering edit mode

I want the result as fasta 1 with all databases then fasta 2 with all databases and so on

Have you thought of using two nested for loops? Outer loop goes through samples one at a time. Inner one does the same for blast DB.

ADD REPLY • link 3.1 years ago by GenoMax 151k

0

Entering edit mode

Hi, NO I have not tried that . Can you please give me an example for the same .

Thank you

ADD REPLY • link 3.1 years ago by Confused_human ▴ 30

0

Entering edit mode

Is this still your previous question framed in a different way again? Did you read and understand what was comented there? Amino Acid identity matrix help

ADD REPLY • link 3.1 years ago by Michael 55k

0

Entering edit mode

NO its not the same question it is a different query itself .

ADD REPLY • link 3.1 years ago by Confused_human ▴ 30

0

Entering edit mode

well, it looks it is about the same 4000 proteins. Anyway, why don't you simply concatenate all sequences into one big blast database and then blast it against itself? Then you get the complete output in one run. No need for loops here.

ADD REPLY • link 3.1 years ago by Michael 55k

0

Entering edit mode

Yes , because I am working on the same dataset but my question is different here . Here I am asking for how to loop the command that I want to perform .

ADD REPLY • link 3.1 years ago by Confused_human ▴ 30

0

Entering edit mode

Ok, no need for a loop...

ADD REPLY • link 3.1 years ago by Michael 55k

0

Entering edit mode

if I run the entire protein fasta with entire db it will give me top 500 hits . there is a limitation . I want all the hits not just top 500

ADD REPLY • link 3.1 years ago by Confused_human ▴ 30

0

Entering edit mode

in folder2 I have blast database for each sequence. it means 3 files for each .

What does this mean?

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

I guess op means the three filed you get when you make a blast database from a fasta file:

        fourk.faa.phr   fourk.faa.pin   fourk.faa.psq

ADD REPLY • link 3.1 years ago by Michael 55k

0

Entering edit mode

yes this three for each

ADD REPLY • link 3.1 years ago by Confused_human ▴ 30

1

Entering edit mode

3.1 years ago

Pierre Lindenbaum 166k

let's do it with nextflow.

invoke with

nextflow run biostar9523782.nf -resume --qdir "/path/to/QUERYDIR" --tdir "/path/to/TARGETDIR"

	nextflow.enable.dsl=2

	/* full path to query directory */
	params.qdir="/DIR1"

	/* full path to target/database directory */
	params.tdir="/DIR2"


	workflow {
	dir_ch = Channel.of(["query",params.qdir],["target",params.tdir])
	fasta_ch = findFasta(dir_ch).splitCsv(header: false,sep:'\t',strip:true)
	query_ch = fasta_ch.filter{T->T[0].equals("query")}.map{T->T[1]}
	target_ch = fasta_ch.filter{T->T[0].equals("target")}.map{T->T[1]}
	blast_ch = blastIt(query_ch.combine(target_ch))
	zipIt(blast_ch.collect())

	}

	process findFasta {
	tag "${type} in ${dir}"
	input:
	tuple val(type),val(dir)
	output:
	path("paths.txt")
	script:
	"""
	test -d "${dir}"
	find "${dir}" -type f -name ".fa" -o -name ".fasta" \|\
	sed 's/^/${type}\t/' > paths.txt
	test -s paths.txt
	"""
	}

	process blastIt {
	tag "${query} / ${target}"
	input:
	tuple val(query),val(target)
	output:
	path("out.blast")
	script:
	"""
	blastp -query "${query}" -db ${target} -out out.blast
	"""
	}

	process zipIt {
	input:
	val(L)
	output:
	path("blast.zip")
	script:
	"""
	zip -9 blast.zip ${L.join(" ")}
	"""
	}

view raw biostar9523782.nf hosted with ❤ by GitHub

ADD COMMENT • link 3.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

3.1 years ago

Michael 55k

Simple trick here, assuming your working directory contains all fasta files:

cat *.fasta > combined.faa
makeblastdb -in combined.faa -dbtype prot -parse_sequids
blastp -db combined.faa -query combined.faa -out allvsall.blast.txt -outfmt 7 -max_target_seqs 4000

ADD COMMENT • link 3.1 years ago by Michael 55k

0

Entering edit mode

yes, if I run the entire protein fasta with entire db it will give me top 500 hits . there is a limitation . I want all the hits not just top 500

ADD REPLY • link 3.1 years ago by Confused_human ▴ 30

0

Entering edit mode

You can tune this parameter too and Blast will also obey values >500 (just tested, there is no hard limit or anything). However, that requires that blast has found at least one HSP. No alignments will be computed if that is not the case. In these cases, there may be less than 4000 sequences reported. In other cases, multiple hits may be reported for a single pair of sequences, e.g. if several parts of one sequence align to the other sequence. In any case, the estimate of sequence identity/similarity obtained from blast is only valid for that local aligned region and not the complete sequences (that is what local alignment means).

ADD REPLY • link 3.1 years ago by Michael 55k

0

Entering edit mode

3.1 years ago

Mensur Dlakic ★ 29k

What you are trying to do is not an optimal way. Or to put it bluntly, it is a massive waste of time. Michael Dondrup already told you an efficient (and proper) way of how all-vs-all searches are done. For the rest:

blastp -help

It will show you that you can customize the number of descriptions in the output (-num_descriptions) and the alignments (-num_alignments). Simply put an arbitrarily large number after those switches, and then only -evalue will be the limiting factor for the number of hits shown.

ADD COMMENT • link 3.1 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

3.0 years ago

Malcolm.Cook ★ 1.5k

yes, if indeed you have 4000 blast databases in folder2, one for each fasta in folder1, you are creating too much work for yourself since you just want to blast a set of sequences against itself....

....so you just want to use blastps "BLAST-2-Sequences options"...

for this you don't need to create a blast database at all!

Oh, and in this mode blastp can work with streams you don't need to create a file with them all combined.

Assuming your shell is bash, try:

blastp -query <(cat folder1/*.fasta) -subject  <(cat folder1/*.fasta) -out test

no for loops needed. But in the future, forget loops for things like this. Install and learn to use GNU Parallel instead!

ADD COMMENT • link 3.0 years ago by Malcolm.Cook ★ 1.5k

Login before adding your answer.