Question

Extract 100 Downstream Sequence Of The Aligned Sequence Of Blast.

0

Entering edit mode

12.8 years ago

User 8159 ▴ 30

Hi all,

I need to get 100 downstream of the nucleotide sequence from aligned results of the web blast query.

Is there any way to get the required sequence from NCBI?

blast sequence data genomics • 3.7k views

ADD COMMENT • link updated 12.8 years ago by Pierre Lindenbaum 164k • written 12.8 years ago by User 8159 ▴ 30

score 2 · Answer 1 · 2012-02-29

The alignments listed in results of your web blast should have a start/end coordinates. Query Start/End is the coordinate of the input sequence that aligned, Subject Start/End is the coordinate of the sequence you aligned to.

If you click on one of the results, it should take you to the information page of that result. Using the top left corner selection box, you can change the result page to a fasta file, giving you the sequence of the result.

Using the subject start/end coordinates, you can find the downstream region that you want.

score 1 · Answer 2 · 2012-02-29

say your result was downloaded as a XML file:

<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>BLASTN 2.2.26+</BlastOutput_version>
  <BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), &quot;A greedy algorithm for aligning DNA sequences&quot;, J Comput Biol 2000;
 7(1-2):203-14.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>56043</BlastOutput_query-ID>
  <BlastOutput_query-def>No definition line</BlastOutput_query-def>
  <BlastOutput_query-len>210</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_sc-match>1</Parameters_sc-match>
      <Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
      <Parameters_gap-open>0</Parameters_gap-open>
      <Parameters_gap-extend>0</Parameters_gap-extend>
      <Parameters_filter>L;m;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>56043</Iteration_query-ID>
  <Iteration_query-def>No definition line</Iteration_query-def>
  <Iteration_query-len>210</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|321173691|gb|HM035539.1|</Hit_id>
  <Hit_def>Rotavirus A Hu/R1949/FRA/2008 non-structural protein 4 gene, partial cds</Hit_def>
  <Hit_accession>HM035539</Hit_accession>
  <Hit_len>659</Hit_len>

the following XSLT stylesheet extracts every gi, downloads the nucleotide sequence using efetch and prints the context +/- 100pb for each Hsp:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
    version='1.0'
    >


<xsl:output method="text" encoding="UTF-8"/>

<xsl:template match="/">

<xsl:apply-templates select="//Hit"/>
</xsl:template>

<xsl:template match="Hit">
<xsl:variable name="def" select="Hit_def"/>
<xsl:variable name="gi" select="substring-before(substring-after(Hit_id,'gi|'),'|')"/>
<xsl:variable name="url" select="concat('&lt;a href=" http:="" eutils.ncbi.nlm.nih.gov="" entrez="" eutils="" efetch.fcgi?db="nucleotide&amp;rettype=fasta&amp;retmode=xml&amp;id=',$gi)" "="" rel="nofollow">http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=xml&id=',$gi)"/>

<xsl:variable name="fasta" select="document($url)/TSeqSet/TSeq/TSeq_sequence"/>

<xsl:for-each select="Hit_hsps/Hsp">
<xsl:variable name="hstart" select="number(Hsp_hit-from)"/>
<xsl:variable name="hend" select="number(Hsp_hit-to)"/>
<xsl:variable name="start">
    <xsl:choose>
        <xsl:when test="$hstart &lt; $hend and $hstart &lt; 100">
            <xsl:value-of select="number(0)"/>
        </xsl:when>
        <xsl:when test="$hstart &lt; $hend">
            <xsl:value-of select="$hstart - 100"/>
        </xsl:when>
        <xsl:when test="$hstart &gt; $hend and $hend &lt; 100">
            <xsl:value-of select="number(0)"/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$hend"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:variable>
<xsl:variable name="end">
    <xsl:choose>
        <xsl:when test="$hstart &lt; $hend">
            <xsl:value-of select="$hend"/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$hstart"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:variable>

<xsl:variable name="seq" select="substring($fasta, $start, 1 + $end - $start)"/>

<xsl:text>></xsl:text>
<xsl:value-of select="concat($def,'|',$hstart,'-',$hend )"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="$seq"/>
<xsl:text>
</xsl:text>

</xsl:for-each>

</xsl:template>

</xsl:stylesheet>

Example:

xsltproc --novalid stylesheet.xsl blast.xml

>Rotavirus A Hu/R1949/FRA/2008 non-structural protein 4 gene, partial cds|351-560
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Rotavirus A Hu/R1778/FRA/2008 non-structural protein 4 gene, partial cds|351-560
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Human rotavirus A isolate 6361 NSP4 (NSP4) gene, complete cds|394-603
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Rotavirus A strain RVA/Human-wt/USA2007719739/2007/G1P[8] segment 10 NSP4 (NSP4) gene, complete cds
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Human rotavirus A strain mani-476/08 non-structural protein 4 (NSP4) gene, complete cds|364-573
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Human rotavirus A strain mani-365/07 non-structural protein 4 (NSP4) gene, complete cds|372-581
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Human rotavirus A strain mani-63/06 non-structural protein 4 (NSP4) gene, complete cds|359-568
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Human rotavirus A strain CMH032/05 nonstructural protein NSP4 (NSP4) gene, complete cds|394-603
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAAATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Human rotavirus A strain CMH015/05 nonstructural protein NSP4 (NSP4) gene, complete cds|394-603
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Rotavirus A Hu/Dhaka6/BGD/2001/G11P25 NSP4 gene, complete cds|394-603
TATAAAGAACAAGTTACTACTAAAGACGAAATTGAACAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA
>Human rotavirus A NSP4 gene for enterotoxin, complete cds, isolate: BSGH 8|379-588
TATAAAGAACAAGTTACCACTAAAGACGAAATTGAGCAACAGATGGATAGAATTGTAAAAGAGATGAGACGTCAGCTGGAGATGATTGATAAATTAACTA