Question

Taking Only Aligned Sequences In A Blast

7

Entering edit mode

14.2 years ago

Anima Mundi ★ 2.9k

Hello,

I would like to know how to take (in FASTA format) only the sequences displayed as results of a BLAST, avoiding to search the desired sequences in the whole subjects' reports.

blast fasta parsing • 15k views

ADD COMMENT • link updated 14.2 years ago by Pierre Lindenbaum 166k • written 14.2 years ago by Anima Mundi ★ 2.9k

1

Entering edit mode

Can you clarify the question? My understanding is that you want to retrieve, in FASTA format, those sequences which are hits (subjects) in a BLAST report?

ADD REPLY • link 14.2 years ago by Neilfws 49k

0

Entering edit mode

I want to obtain as FASTAs only the parts displayed in the alignment in the results' page. Sorry for the messy question, hope it is more clear now.

ADD REPLY • link 14.2 years ago by Anima Mundi ★ 2.9k

Ram · Answer 1 · 2011-02-19

Here is a solution that will run Blast and give you a fasta file of all the sequences that were hits.

See also these Bioinformatics exercises (step 6).

Here is my entire run (should be very reproducible):

# An example database
wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.faa

# Prepare dababase for Blast search  
formatdb -i AE004437.faa -p T -o

# Generate an example query
awk --posix -v RS='>' '/WWW/ { print ">" $0;}' AE004437.faa \
  > blast_query_2seqs.fasta

# Run blast with tabular output (extract the hit only)
blastall -p blastp -i blast_query_2seqs.fasta -d AE004437.faa -m 8 -e 1e-5 | \
  awk '{print $2}' | sort | uniq \
  > ids_of_4hits

# Get the fasta file for hits
fastacmd -d AE004437.faa -i ids_of_4hits > hits.fasta

Just replace:

blast_query_2seqs.fasta with the sequences to be used as the query.
AE004437.faa with your search database
You may choose a different threshold instead of -e 1e-5

The found sequences will be in hits.fasta

Ram · Answer 2 · 2011-02-19

OK, just for fun I'll use XSLT to answer this question. As a test, using the NCBI web interface, I blast-ed gi|27592135|gb|CB017399.1 vs est_others/GenBank non-mouse and non-human EST entries and I downloaded the result as XML:

<?xml version="1.0"?>
<BlastOutput>
   (...)
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>gi|27592135|gb|CB017399.1|</Iteration_query-ID>
  <Iteration_query-def>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gal
lus gallus cDNA clone pgn1c.pk016.a18 5&amp;apos; similar to ref|XP_176823.1 similar
 to Rotavirus X associated non-structural protein (RoXaN) [Mus musculus] ref|XP_
193795.1| similar to Rotavirus X associated non-structural protein (RoXaN) [Mus
musculus], mRNA sequence</Iteration_query-def>
  <Iteration_query-len>671</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|27592135|gb|CB017399.1|</Hit_id>
  <Hit_def>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus gallus c
DNA clone pgn1c.pk016.a18 5&amp;apos; similar to ref|XP_176823.1 similar to Rotaviru
s X associated non-structural protein (RoXaN) [Mus musculus] ref|XP_193795.1| si
milar to Rotavirus X associated non-structural protein (RoXaN) [Mus musculus], m
RNA sequence</Hit_def>
  <Hit_accession>CB017399</Hit_accession>
  <Hit_len>671</Hit_len>
  <Hit_hsps>
    <Hsp>

The following XSLT stylesheet download the hits and print the 10 first hits as FASTA:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
   version='1.0'
   >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:for-each select="//Hit_id">
<xsl:if test="position() &amp;lt; 11">
<xsl:variable name="gi" select="substring-before(substring-after(.,'|'),'|')"/>
<xsl:variable name="url" select="concat('&lt;a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="nucleotide&amp;amp;rettype=fasta&amp;amp;retmode=xml&amp;amp;id=',$gi)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=xml&id=',$gi)"/>
<xsl:message>Downloading <xsl:value-of select="$url"/></xsl:message>
<xsl:apply-templates select="document($url,/TSeqSet)" mode="fasta"/>
</xsl:if>
</xsl:for-each>
</xsl:template>

<xsl:template match="TSeqSet" mode="fasta">
<xsl:for-each select="TSeq">
<xsl:text>></xsl:text>
<xsl:value-of select="TSeq_defline"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="TSeq_sequence"/>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

Usage:

xsltproc --novalid stylesheet.xsl blast.xml

Result:

>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus gallus cDNA clone
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTCACGGGTCG...
>AGENCOURT_112364831 NIH_MGC_429 Rattus norvegicus cDNA clone IMAGE:9028078 5',
TTGAACTTATAGACAGTCGGTCCGGATTCCCGGGACAAGGCCCGGCATTGCTGGACCAAGGAGAGGCGGGTCCTTCTGGT...
>UI-R-FJ0-cpv-h-02-0-UI.r1 UI-R-FJ0 Rattus norvegicus cDNA clone UI-R-FJ0-cpv-h-
CCTCAAGTACTGCAGCGCCAAGGCCCGGCATTGCTGGACCAAGGAGAGGCGGGTCCTTCTGGTGATGTCCAAGGCCAAGA...
>LIB38529_023_H02_T7_2 LIB38529 Canis lupus familiaris cDNA clone LIB38529_023_H
GCAGCAGACCTATGACATGTGGCTGAAGAAACACAACCCAGGGAAGCCAGGGGAAGGGACCCCGATCAGTTCTCGGGAAG...
>LIB38529_023_H02_T7_1 LIB38529 Canis lupus familiaris cDNA clone LIB38529_023_H
GCAGCAGACCTATGACATGTGGCTGAAGAAACACAACCCAGGGAAGCCAGGGGAAGGGACCCCGATCAGTTCTCGGGAAG...