Question

Blast locally on multiple fasta files with multiple database

0

Entering edit mode

5.5 years ago

fec2 ▴ 50

Hi all,

I need to run blast locally on multiple fasta files contain in a directory. Previously, I have used command below:

ls *.fasta | parallel -a - blastp -query {} -db my_database -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out {.}.tsv

Since I need to do pairwise comparison and need to blast all fasta files against each other, the command above will need to performe multiple time because I have to specify my database. Is it possible to have a command that is able to blast all fasta file against each other? And generate output file with file name that combined the name of my database plus the query file name?

Thank you.

sequence genome • 5.7k views

ADD COMMENT • link 3.2 years ago by fec2 ▴ 50

0

Entering edit mode

Small nitpick but

Since I need to do pairwise comparison and need to blast all fasta files against each other

does not match the title of this post.

Blast locally on multiple fasta files with multiple database

May want to edit one or other.

ADD REPLY • link 5.5 years ago by GenoMax 147k

0

Entering edit mode

the command above will need to performed multiple time because I have to specify my database
And generate output file name that combined of the name of my database
-db mydatabase

ADD REPLY • link 5.5 years ago by AK ★ 2.2k

0

Entering edit mode

Hi,

Thank you. Any suggestion for me to change the title of the post?

ADD REPLY • link 5.5 years ago by fec2 ▴ 50

0

Entering edit mode

Thanks all, the command work well!

ADD REPLY • link 5.5 years ago by fec2 ▴ 50

score 2 · Accepted Answer · 2019-05-30

2

Entering edit mode

5.5 years ago

AK ★ 2.2k

Something like this:

#!/bin/bash
set -euo pipefail

for q in *.fasta; do
  for s in *.fasta; do
    # Ignore self-comparison
    if [[ "${q}" != "${s}" ]]; then
      # Before blastp, check if exist or create database for ${s} (I leave this to you)
      blastp -query ${q} -db ${s} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out ${q}_2_${s}.tsv
    fi
  done
done

ADD COMMENT • link 5.5 years ago by AK ★ 2.2k

2

Entering edit mode

This is NOT doing pairwise blast comparisons. One would need to use

blastp -query ${q} -subject ${s} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out ${q}_2_${s}.tsv

with fasta format files.

ADD REPLY • link 5.5 years ago by GenoMax 147k

0

Entering edit mode

Thanks, genomx for introducing the -subject argument. Will blastp -query ${q} -subject ${s} build a db for ${s} each time? Or it's without the use of db (how's tradeoff for big fasta file?)

Because as I commented in the response:

Before blastp, check if exist or create database for ${s} (I leave this to you)

which assumed there is a db with prefix ${s}, or build the db if not exist, only the first time.

ADD REPLY • link 5.5 years ago by AK ★ 2.2k

1

Entering edit mode

Other people already asked about possible differences between -subject and -db:

Different results with BLAST depending on if subject is formatted database or FASTA-file

blast - difference between database and subject

The first post indicates it might result in very different output due to how blast parses the subject in each case, though I would like to see confirmation before believing (and I am feeling lazy to test this today, maybe another day).

ADD REPLY • link 5.5 years ago by h.mon 35k

0

Entering edit mode

-subject allows direct comparisons of two sequence files against each other. No database needed.

ADD REPLY • link 5.5 years ago by GenoMax 147k

score 1 · Accepted Answer · 2019-05-30

1

Entering edit mode

5.5 years ago

h.mon 35k

I will borrow from SMK loop above to provide a GNU parallel solution.

First save the all vs all filenames into a file:

for q in *.fasta; do
  for s in *.fasta; do
    # Ignore self-comparison
    if [[ "${q}" != "${s}" ]]; then
      echo "${q} ${s}" >> allpairs.txt
    fi
  done
done

Then cat the filenames into parallel to build the multiple blast commands:

cat pairs.txt | parallel --colsep ' ' -j 1 \
    'blastp -query {1} -subject {2} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out {1.}_vs_{2.}.tsv'

If you have multiple ~~processors~~ cpu cores, you can use -j n to perform n blast searches simultaneously. You don't need to build blast databases, blast can perform fasta vs fasta comparison with -query file1 -subject file2 , but in case each fasta file is large, then it would be best if you build the databases beforehand, and replace -subject {2} by -db {2.}.

edits / comments:

in view of my comment above, you should check if indeed -subject and -db behave the same way or not.
with older blast versions, scaling wasn't very good and it was faster to perform n parallel searches than using -num_threads n. This may have improved in recent blast versions (current is BLAST+ 2.9.0), but I didn't test.
keep in mind if you run several blast searches simultaneously with GNU parallel, memory usage will increase correspondingly compared to just one blast search.

ADD COMMENT • link 5.5 years ago by h.mon 35k

0

Entering edit mode

Tried a query against the human genome. Was pretty quick. So does not look like OP will need to build databases.

ADD REPLY • link 5.5 years ago by GenoMax 147k

0

Entering edit mode

Hi, is it possible to do reverse search using the to match with same BLAST cut-off?

ADD REPLY • link 3.2 years ago by fec2 ▴ 50

0

Entering edit mode

If you switch

echo "${q} ${s}" >> allpairs.txt

to

echo "${s} ${q}" >> allpairs_rev.txt

that should get you the reverse comparisons.

ADD REPLY • link 3.2 years ago by GenoMax 147k

0

Entering edit mode

Actually I need to do an analysis on 422 genome, but system in my lab can only allow the job to run for 3 days. Therefore, I can't use the other tool online (all of them using phyton and I can't understand the script). Therefore, I wish to use script similar to above (I edited the allpairs.txt file and continue for another 3 days).

Basically they perform BLAST from the query genome against the reference genome with cut-offs of at least 30% identity and at least 70% coverage. Then they took the top match and performed the reverse search using BLAST with the same cut-offs. I don't understand how they use the top match and performed the reverse search.

ADD REPLY • link 3.2 years ago by fec2 ▴ 50

0

Entering edit mode

If I understand your question, one does not take just the top match and perform the reverse search. Idea behind a reciprocal blast search is that homologs will show up as (one of) top hits, no matter which direction one does the search in. That confirms their evolutionary relatedness.

ADD REPLY • link 3.2 years ago by GenoMax 147k

0

Entering edit mode

Yes, in the website of the online tool, they did mention about reciprocal best hits between two genomic datasets of proteins. Is that mean that I can create a allpairs_rev.txt, and average 2 values?

ADD REPLY • link 3.2 years ago by fec2 ▴ 50

0

Entering edit mode

I am not sure what two values you are referring to. What exactly are you trying to do?

ADD REPLY • link 3.2 years ago by GenoMax 147k

0

Entering edit mode

I have to do an analysis called average nucleotide identity (AAI). There are many tool available, for example compareM that I previously used. However, this time I need to analyse 422 genome, but my system in the lab can only allows a job to run for 3 days. I have tried a few tools but non of these tools can finish the job in 3 days. So I am thinking to do it manually using some commands like in this post. The command in this post is for another analysis called percentage of conserved protein. Because I can edit the allpairs.txt file and continue the BLAST after 3 days, so I able to do it manually and continuously until the BLAST job done for all genome. But for AAI, I dont really understand the meaning of reciprocal best hits between two genomic datasets of proteins.

ADD REPLY • link 3.2 years ago by fec2 ▴ 50

0

Entering edit mode

From I checked online, the AAI analysis actually only using a representative of paralog genes (with highest identity), so the total number of genes calculated will also deduct the the paralogous gene that not included in the analysis.

ADD REPLY • link 3.2 years ago by fec2 ▴ 50