Taking Only Aligned Sequences In A Blast
2
7
Entering edit mode
13.9 years ago
Anima Mundi ★ 2.9k

Hello,

I would like to know how to take (in FASTA format) only the sequences displayed as results of a BLAST, avoiding to search the desired sequences in the whole subjects' reports.

blast fasta parsing • 15k views
ADD COMMENT
1
Entering edit mode

Can you clarify the question? My understanding is that you want to retrieve, in FASTA format, those sequences which are hits (subjects) in a BLAST report?

ADD REPLY
0
Entering edit mode

I want to obtain as FASTAs only the parts displayed in the alignment in the results' page. Sorry for the messy question, hope it is more clear now.

ADD REPLY
6
Entering edit mode
13.8 years ago

Here is a solution that will run Blast and give you a fasta file of all the sequences that were hits.

See also these Bioinformatics exercises (step 6).

Here is my entire run (should be very reproducible):

# An example database
wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.faa

# Prepare dababase for Blast search  
formatdb -i AE004437.faa -p T -o

# Generate an example query
awk --posix -v RS='>' '/WWW/ { print ">" $0;}' AE004437.faa \
  > blast_query_2seqs.fasta

# Run blast with tabular output (extract the hit only)
blastall -p blastp -i blast_query_2seqs.fasta -d AE004437.faa -m 8 -e 1e-5 | \
  awk '{print $2}' | sort | uniq \
  > ids_of_4hits

# Get the fasta file for hits
fastacmd -d AE004437.faa -i ids_of_4hits > hits.fasta

Just replace:

  1. blast_query_2seqs.fasta with the sequences to be used as the query.
  2. AE004437.faa with your search database
  3. You may choose a different threshold instead of -e 1e-5

The found sequences will be in hits.fasta

ADD COMMENT
5
Entering edit mode
13.8 years ago

OK, just for fun I'll use XSLT to answer this question. As a test, using the NCBI web interface, I blast-ed gi|27592135|gb|CB017399.1 vs est_others/GenBank non-mouse and non-human EST entries and I downloaded the result as XML:

<?xml version="1.0"?>
<BlastOutput>
   (...)
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>gi|27592135|gb|CB017399.1|</Iteration_query-ID>
  <Iteration_query-def>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gal
lus gallus cDNA clone pgn1c.pk016.a18 5&amp;apos; similar to ref|XP_176823.1 similar
 to Rotavirus X associated non-structural protein (RoXaN) [Mus musculus] ref|XP_
193795.1| similar to Rotavirus X associated non-structural protein (RoXaN) [Mus
musculus], mRNA sequence</Iteration_query-def>
  <Iteration_query-len>671</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|27592135|gb|CB017399.1|</Hit_id>
  <Hit_def>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus gallus c
DNA clone pgn1c.pk016.a18 5&amp;apos; similar to ref|XP_176823.1 similar to Rotaviru
s X associated non-structural protein (RoXaN) [Mus musculus] ref|XP_193795.1| si
milar to Rotavirus X associated non-structural protein (RoXaN) [Mus musculus], m
RNA sequence</Hit_def>
  <Hit_accession>CB017399</Hit_accession>
  <Hit_len>671</Hit_len>
  <Hit_hsps>
    <Hsp>

The following XSLT stylesheet download the hits and print the 10 first hits as FASTA:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
   version='1.0'
   >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:for-each select="//Hit_id">
<xsl:if test="position() &amp;lt; 11">
<xsl:variable name="gi" select="substring-before(substring-after(.,'|'),'|')"/>
<xsl:variable name="url" select="concat('&lt;a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="nucleotide&amp;amp;rettype=fasta&amp;amp;retmode=xml&amp;amp;id=',$gi)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=xml&id=',$gi)"/>
<xsl:message>Downloading <xsl:value-of select="$url"/></xsl:message>
<xsl:apply-templates select="document($url,/TSeqSet)" mode="fasta"/>
</xsl:if>
</xsl:for-each>
</xsl:template>

<xsl:template match="TSeqSet" mode="fasta">
<xsl:for-each select="TSeq">
<xsl:text>></xsl:text>
<xsl:value-of select="TSeq_defline"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="TSeq_sequence"/>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

Usage:

xsltproc --novalid stylesheet.xsl blast.xml

Result:

>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus gallus cDNA clone
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTCACGGGTCG...
>AGENCOURT_112364831 NIH_MGC_429 Rattus norvegicus cDNA clone IMAGE:9028078 5',
TTGAACTTATAGACAGTCGGTCCGGATTCCCGGGACAAGGCCCGGCATTGCTGGACCAAGGAGAGGCGGGTCCTTCTGGT...
>UI-R-FJ0-cpv-h-02-0-UI.r1 UI-R-FJ0 Rattus norvegicus cDNA clone UI-R-FJ0-cpv-h-
CCTCAAGTACTGCAGCGCCAAGGCCCGGCATTGCTGGACCAAGGAGAGGCGGGTCCTTCTGGTGATGTCCAAGGCCAAGA...
>LIB38529_023_H02_T7_2 LIB38529 Canis lupus familiaris cDNA clone LIB38529_023_H
GCAGCAGACCTATGACATGTGGCTGAAGAAACACAACCCAGGGAAGCCAGGGGAAGGGACCCCGATCAGTTCTCGGGAAG...
>LIB38529_023_H02_T7_1 LIB38529 Canis lupus familiaris cDNA clone LIB38529_023_H
GCAGCAGACCTATGACATGTGGCTGAAGAAACACAACCCAGGGAAGCCAGGGGAAGGGACCCCGATCAGTTCTCGGGAAG...
ADD COMMENT
1
Entering edit mode

I think Pierre is the only person who uses XSLT for fun ;-)

ADD REPLY
0
Entering edit mode

+1 Nice. Looks pretty durable. It works. Verified by using blast -m 7 ... in my environment. Unfortunately, the stylesheet is hard to read. Please see my -m 8 based solution.

ADD REPLY
0
Entering edit mode

Hard to read ?! :-) It's only 30 lines long. THIS is hard to read: http://docbook.sourceforge.net/release/xsl/current/fo/titlepage.templates.xsl

ADD REPLY
0
Entering edit mode

4500 lines and 99% begin with "xsl". Made me read the 30 line style. It downloads the sequences from NCBI even though the information would be available on-site (in your search database). Not all biological sequences will be in NCBI. If the gi is not there then this approach will fail.

ADD REPLY
0
Entering edit mode

you're absolutely right ! Again, I wanted to have fun with the output of NCBI/XML/blast because xslt differs from the usual solutions :-)

ADD REPLY

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6