Hello,
I would like to know how to take (in FASTA format) only the sequences displayed as results of a BLAST, avoiding to search the desired sequences in the whole subjects' reports.
Hello,
I would like to know how to take (in FASTA format) only the sequences displayed as results of a BLAST, avoiding to search the desired sequences in the whole subjects' reports.
Here is a solution that will run Blast and give you a fasta file of all the sequences that were hits.
See also these Bioinformatics exercises (step 6).
Here is my entire run (should be very reproducible):
# An example database
wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.faa
# Prepare dababase for Blast search
formatdb -i AE004437.faa -p T -o
# Generate an example query
awk --posix -v RS='>' '/WWW/ { print ">" $0;}' AE004437.faa \
> blast_query_2seqs.fasta
# Run blast with tabular output (extract the hit only)
blastall -p blastp -i blast_query_2seqs.fasta -d AE004437.faa -m 8 -e 1e-5 | \
awk '{print $2}' | sort | uniq \
> ids_of_4hits
# Get the fasta file for hits
fastacmd -d AE004437.faa -i ids_of_4hits > hits.fasta
Just replace:
blast_query_2seqs.fasta
with the sequences to be used as the query.AE004437.faa
with your search database-e 1e-5
The found sequences will be in hits.fasta
OK, just for fun I'll use XSLT to answer this question. As a test, using the NCBI web interface, I blast-ed gi|27592135|gb|CB017399.1 vs est_others/GenBank non-mouse and non-human EST entries and I downloaded the result as XML:
<?xml version="1.0"?>
<BlastOutput>
(...)
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|27592135|gb|CB017399.1|</Iteration_query-ID>
<Iteration_query-def>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gal
lus gallus cDNA clone pgn1c.pk016.a18 5&apos; similar to ref|XP_176823.1 similar
to Rotavirus X associated non-structural protein (RoXaN) [Mus musculus] ref|XP_
193795.1| similar to Rotavirus X associated non-structural protein (RoXaN) [Mus
musculus], mRNA sequence</Iteration_query-def>
<Iteration_query-len>671</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|27592135|gb|CB017399.1|</Hit_id>
<Hit_def>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus gallus c
DNA clone pgn1c.pk016.a18 5&apos; similar to ref|XP_176823.1 similar to Rotaviru
s X associated non-structural protein (RoXaN) [Mus musculus] ref|XP_193795.1| si
milar to Rotavirus X associated non-structural protein (RoXaN) [Mus musculus], m
RNA sequence</Hit_def>
<Hit_accession>CB017399</Hit_accession>
<Hit_len>671</Hit_len>
<Hit_hsps>
<Hsp>
The following XSLT stylesheet download the hits and print the 10 first hits as FASTA:
<xsl:stylesheet xmlns:xsl="<a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
version='1.0'
>
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="//Hit_id">
<xsl:if test="position() &lt; 11">
<xsl:variable name="gi" select="substring-before(substring-after(.,'|'),'|')"/>
<xsl:variable name="url" select="concat('<a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="nucleotide&amp;rettype=fasta&amp;retmode=xml&amp;id=',$gi)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=xml&id=',$gi)"/>
<xsl:message>Downloading <xsl:value-of select="$url"/></xsl:message>
<xsl:apply-templates select="document($url,/TSeqSet)" mode="fasta"/>
</xsl:if>
</xsl:for-each>
</xsl:template>
<xsl:template match="TSeqSet" mode="fasta">
<xsl:for-each select="TSeq">
<xsl:text>></xsl:text>
<xsl:value-of select="TSeq_defline"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="TSeq_sequence"/>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Usage:
xsltproc --novalid stylesheet.xsl blast.xml
Result:
>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus gallus cDNA clone
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTCACGGGTCG...
>AGENCOURT_112364831 NIH_MGC_429 Rattus norvegicus cDNA clone IMAGE:9028078 5',
TTGAACTTATAGACAGTCGGTCCGGATTCCCGGGACAAGGCCCGGCATTGCTGGACCAAGGAGAGGCGGGTCCTTCTGGT...
>UI-R-FJ0-cpv-h-02-0-UI.r1 UI-R-FJ0 Rattus norvegicus cDNA clone UI-R-FJ0-cpv-h-
CCTCAAGTACTGCAGCGCCAAGGCCCGGCATTGCTGGACCAAGGAGAGGCGGGTCCTTCTGGTGATGTCCAAGGCCAAGA...
>LIB38529_023_H02_T7_2 LIB38529 Canis lupus familiaris cDNA clone LIB38529_023_H
GCAGCAGACCTATGACATGTGGCTGAAGAAACACAACCCAGGGAAGCCAGGGGAAGGGACCCCGATCAGTTCTCGGGAAG...
>LIB38529_023_H02_T7_1 LIB38529 Canis lupus familiaris cDNA clone LIB38529_023_H
GCAGCAGACCTATGACATGTGGCTGAAGAAACACAACCCAGGGAAGCCAGGGGAAGGGACCCCGATCAGTTCTCGGGAAG...
Hard to read ?! :-) It's only 30 lines long. THIS is hard to read: http://docbook.sourceforge.net/release/xsl/current/fo/titlepage.templates.xsl
4500 lines and 99% begin with "xsl". Made me read the 30 line style. It downloads the sequences from NCBI even though the information would be available on-site (in your search database). Not all biological sequences will be in NCBI. If the gi is not there then this approach will fail.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can you clarify the question? My understanding is that you want to retrieve, in FASTA format, those sequences which are hits (subjects) in a BLAST report?
I want to obtain as FASTAs only the parts displayed in the alignment in the results' page. Sorry for the messy question, hope it is more clear now.