If I understand this correctly you want to perform an NCBI BLAST+ blastp search against a database provided by NCBI (as Michael says most likely 'nr': the non-identical protein sequence database produced by NCBI) and from the BLAST result you want the hit alignments with their statistics (score, E-value, %identity and %similarity), as they appear in the "Alignments" section of the web result.
So working backwards...
The "Alignments" section in the on-line result is almost a direct reflection of the default NCBI BLAST+ or Legacy NCBI BLAST output with a few bits of HTML added. You can see the plain result which is produced by the program by using the "Formatting options", and disabling the HTML output. NCBI BLAST+ can produce a number of other formats, see "16. The BLAST Sequence Analysis Tool" in the "The NCBI Handbook" and "BLAST Command Line Applications User Manual" in "BLAST Help".
On to the NCBI BLAST databases, these are available from the NCBI's FTP site, descriptions of the available databases can be found in the "BLAST FTP Site" documentation and the "blastftp.txt" file on the FTP site. There are dependencies between some of the databases, for example the 'swissprot' database requires the 'nr' database due to being implemented a subset of 'nr' using BLAST mask files. So be sure to download all of the required files. The 'update_blastdb.pl' script mentioned by Michael can be used to mirror the NCBI produced BLAST databases. While you can create these databases using makeblastdb
and the fasta format files provided by NCBI, the resulting databases will be missing additional information (e.g. Taxonomy) present in the pre-formatted databases, since these are generated from the ASN.1 not the fasta format.
The command you quote will use a multiple sequence alignment (MSA) as the query for a PSI-BLAST search. From your description I suspect this is not what you intended. Instead I think you wanted to perform a blastp (protein sequence vs. protein sequence database) search, which would be something like:
blastp -query seq.tfa -db 'nr'
Depending on what you are doing you may find it more convenient to use the web services to access NCBI's BLAST services remotely rather than trying to maintain the BLAST databases locally. You can do this with the NCBI BLAST+ binaries by using the -remote
option, for example:
blastp -query seq.tfa -db 'nr' -remote
To develop programs which use the NCBI BLAST services, see "BLAST Developer Information" for details of the available REST and SOAP APIs. These APIs are supported by many of the bioinformatics code libraries (e.g. BioPerl, BioPython, .NET Bio, etc.) so see their documentation for details. Depending on your database requirements you may also want to look at other organisations which provide NCBI BLAST based services, see BioCatalogue.org for a selection of web services providing NCBI BLAST searches. For example EMBL-EBI provide REST and SOAP APIs for their NCBI BLAST service (used by UniProt.org to power their BLAST search).
If you want to derive a multiple sequence alignment (MSA) from the NCBI BLAST blastp output, then you may want to look at tools such as DbClustal and MView (see http://www.ebi.ac.uk/Tools/msa/ for services and pointers to documentation and downloads).
- Update for additional information relating to the original question
Alternatively if you want to perform a multiple pairwise sequence alignment (multiple PSA) for a set of sequences you can use the pairwise sequence alignment (PSA) functionality in the NCBI BLAST+ programs (in "Legacy" NCBI BLAST this used the bl2seq
program instead) thus:
blastp -query querySeqs.tfa -subject targetSeqs.tfa
Gives a a set of pairwise alignments for each sequence in querySeqs.tfa
vs. each sequence in targetSeqs.tfa
. Note that when performing the alignments this way the statistics are based on the pair of sequences being aligned rather than on the query and database. For details of some other methods to consider using when performing multiple pairwise sequence alignments see this post.
i think it's most likely NR, but you need to understand which databas is appropriate and decide that for yourself, and no, you don't need FASTA files, you need the processed database. Read the readme file in ftp.ncbi.nih.gov/blast/db/ first, and see my answer below.
"and I need the database." Which one, NR, we cannot know which database you want to use, can we?
you are right. I want to do a protein sequence alignment using Blastp. I know that for sequence alignment I need to get a FASTA from (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/). Do you know which one is the default one used by pBlast people on the online version?