Extracting the <Hit_def> from Blast xml output using Biopython and saving in .csv
1
0
Entering edit mode
10.0 years ago
Anushka ▴ 20

Hello,

I have the blast output in .xml form and I want to retrieve few attributes like <hit_def>. I found the parser on biophython.

CODE:

from Bio.Blast import NCBIXML
blast = NCBIXML.parse(open('output.xml', 'rU'))
for record in blast:
    for align in record.alignments:
        for hsp in align.hsps:
            print hsp.score, align.hit_def

Q1: Above code is just printing the out put on the terminal. Could anyone help me how to store the output file in .csv format.

Specifically, I need output.csv with these attribute <Iteration_query-def>, <Hit_def>, <Hsp_score>, <Hsp_evalue> as columns, in a .csv format.

Q2: How can I to get the result just for the best hit of each query? While running blastp setting -max_target_seqs to 1 will do the same?

Following is a segment of my input xml

      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>Query_1</Iteration_query-ID>
      <Iteration_query-def>comp552019_c3_seq6_V2</Iteration_query-def>
      <Iteration_query-len>227</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|148727288|ref|NP_002327.2|</Hit_id>
          <Hit_def>low-density lipoprotein receptor-related protein 6 precursor [Homo sapiens] &gt;gi|578822872|ref|XP_006719141.1| PREDICTED: low-density lipoprotein receptor-related protein 6 isoform X1 [Homo sapiens]</Hit_def>
          <Hit_accession>NP_002327</Hit_accession>
          <Hit_len>1613</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>43.5133894476967</Hsp_bit-score>
              <Hsp_score>101</Hsp_score>
              <Hsp_evalue>0.000198686946331968</Hsp_evalue>
              <Hsp_query-from>43</Hsp_query-from>
              <Hsp_query-to>223</Hsp_query-to>
              <Hsp_hit-from>589</Hsp_hit-from>
              <Hsp_hit-to>767</Hsp_hit-to>
              <Hsp_query-frame>0</Hsp_query-frame>
              <Hsp_hit-frame>0</Hsp_hit-frame>
              <Hsp_identity>53</Hsp_identity>
              <Hsp_positive>79</Hsp_positive>
              <Hsp_gaps>24</Hsp_gaps>
              <Hsp_align-len>192</Hsp_align-len>
              <Hsp_qseq>TNEC--HDSKCEHICLARDAGGFVCKCSPGFTLVSGYK-CVSDSVTDDYILVADLGQKRLFQLPIRKST-----RNVGDLVAIDLDDVTDDRIYAASVIKKTGGLAWFDISAREIV--WGSKRLSRDDAVLSITTGCCNKKVYWTTQTGIYSWDGVSSTPDKLYSVSFFSDA-QIRQVVVDCKANLLYWIEY</Hsp_qseq>
              <Hsp_hseq>SNPCAEENGGCSHLCLYRPQG-LRCACPIGFELISDMKTCI---VPEAFLLFSRRADIRRISLETNNNNVAIPLTGVKEASALDFD-VTDNRIYWTDISLKTISRAFMNGSALEHVVEFGL------DYPEGMAVDWLGKNLYW-ADTGTNRIE-VSKLDGQHRQVLVWKDLDSPRALALDPAEGFMYWTEW</Hsp_hseq>
              <Hsp_midline>+N C   +  C H+CL R  G   C C  GF L+S  K C+   V + ++L +     R   L    +        V +  A+D D VTD+RIY   +  KT   A+ + SA E V  +G       D    +      K +YW   TG    + VS    +   V  + D    R + +D     +YW E+</Hsp_midline>
            </Hsp>
            <Hsp>
              <Hsp_num>2</Hsp_num>
              <Hsp_bit-score>39.6613936885231</Hsp_bit-score>
              <Hsp_score>91</Hsp_score>
              <Hsp_evalue>0.00402563881724524</Hsp_evalue>
              <Hsp_query-from>44</Hsp_query-from>
              <Hsp_query-to>128</Hsp_query-to>
              <Hsp_hit-from>891</Hsp_hit-from>
              <Hsp_hit-to>980</Hsp_hit-to>
              <Hsp_query-frame>0</Hsp_query-frame>
              <Hsp_hit-frame>0</Hsp_hit-frame>
              <Hsp_identity>34</Hsp_identity>
              <Hsp_positive>43</Hsp_positive>
              <Hsp_gaps>15</Hsp_gaps>
              <Hsp_align-len>95</Hsp_align-len>
              <Hsp_qseq>NECHDSK--CEHICLARDAGGFVCKCSPGFTLVSGYKCVSDSVTDDYI--------LVADLGQKRLFQLPIRKSTRNVGDLVAIDLDDVTDDRIY</Hsp_qseq>
              <Hsp_hseq>NECASSNGHCSHLCLAVPVGGFVCGCPAHYSLNADNRTCSAPTTFLLFSQKSAINRMVIDEQQSPDIILPIH-SLRNV---RAIDYDPL-DKQLY</Hsp_hseq>
              <Hsp_midline>NEC  S   C H+CLA   GGFVC C   ++L +  +  S   T            +V D  Q     LPI  S RNV    AID D + D ++Y</Hsp_midline>
            </Hsp>
          </Hit_hsps>

I would really appreciate your help.

Thanks

blast python bioython blastp • 5.6k views
ADD COMMENT
1
Entering edit mode

using xsltproc rather than python would be straighforward.

ADD REPLY
0
Entering edit mode
10.0 years ago
Ram 44k

You could redirect output to a CSV file using File IO. Open a file in write mode and modify the print so it writes into the file. Here is one of many resources.

Google away for more. This link should help you get the attributes you require.

Q2: Best hit is an ambiguous term. Each hit can have multiple HSPs and you'd need to average or sum across HSP scores to find the "best" alignment.

ADD COMMENT

Login before adding your answer.

Traffic: 2565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6