How to merge BLAST XML output from different database?
1
0
Entering edit mode
7.6 years ago
tcf.hcdg ▴ 70

Hello I did the command line BLAST against NCBI nr database. Because of the limited computational resourses, administrator at our university reccomended me to split the query sequence as well as databases (120gb of nr). I splited the database in almost two equall parts and then did the blastx against each part. Now I have the blast results of each query sequence in tow parts. Each query have the maximum of 20 hits in each part of the database. I would like to use BLAST2GO for further annotation. But BLAST2GO can accept the multiple files for only one part of the database.. Therefore, I would like to merge the results of each query sequence from both part of the database. I could merge the tabulat output of blast and but with XML files it is not working. Could you please guide me how can I merge BLAST XML results from different databases.

BLASTXML merge blast • 4.1k views
ADD COMMENT
0
Entering edit mode

Now I have the blast results of each query sequence in tow parts

does that mean that the very same query is present in the two xml files ?

ADD REPLY
0
Entering edit mode

Yes I use the same query sequence for both part of database.

ADD REPLY
0
Entering edit mode

For example "TRINITY_DN99193_c0_g5" query against database1


http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastx</BlastOutput_program>
  <BlastOutput_version>BLASTX 2.6.0+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>/scratch/tb44227/db/nrdb_1/nr</BlastOutput_db>
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>TRINITY_DN99193_c0_g5_i1 len=580</BlastOutput_query-def>
  <BlastOutput_query-len>580</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>0.001</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_1</Iteration_query-ID>
  <Iteration_query-def>TRINITY_DN99193_c0_g5_i1 len=580</Iteration_query-def>
  <Iteration_query-len>580</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|598008123|ref|XP_007370486.1|</Hit_id>
  <Hit_def>hypothetical protein DICSQDRAFT_93145, partial [Dichomitus squalens LYAD-421 SS1] >gi|395324364|gb|EJF56806.1| hypothetical protein DICSQDRAFT_93145, partial [Dichomitus squalens LYAD-421 SS1]</Hit_def>

. . .

AND "TRINITY_DN99193_c0_g5" query against database2

<BlastOutput_query-def>TRINITY_DN99193_c0_g5_i1 len=580</BlastOutput_query-def>
  <BlastOutput_query-len>580</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>0.001</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_1</Iteration_query-ID>
  <Iteration_query-def>TRINITY_DN99193_c0_g5_i1 len=580</Iteration_query-def>
  <Iteration_query-len>580</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|973327510|gb|KUL87405.1|</Hit_id>
  <Hit_def>hypothetical protein ZTR_04698 [Talaromyces verruculosus]</Hit_def>
  <Hit_accession>KUL87405</Hit_accession>
  <Hit_len>171</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>131.724</Hsp_bit-score>
      <Hsp_score>330</Hsp_score>
      <Hsp_evalue>1.36501e-35</Hsp_evalue>
      <Hsp_query-from>134</Hsp_query-from>
      <Hsp_query-to>580</Hsp_query-to>
      <Hsp_hit-from>22</Hsp_hit-from>
      <Hsp_hit-to>170</Hsp_hit-to>
      <Hsp_query-frame>-1</Hsp_query-frame>
      <Hsp_hit-frame>0</Hsp_hit-frame>

. . .

ADD REPLY
0
Entering edit mode

Is there only ONE query def / Iteration per XML file ?

ADD REPLY
0
Entering edit mode

No, ecah file has 50000 query sequences, am I have 62 files 31 for each.

ADD REPLY
0
Entering edit mode
7.6 years ago

Ok, I've quickly written something: https://github.com/lindenb/jvarkit/wiki/MergeBlastXml

$ java -jar dist/mergeblastxml.jar input1.blastn.xml  input2.blastn.xml  input2.blastn.xml > out.xml
ADD COMMENT
0
Entering edit mode
java -jar dist/mergeblastxml.jar ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1 ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2 > output.xml

[INFO][MergeBlastXml]resolveEntity:-//NCBI//NCBI BlastOutput/EN/http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd/null
[INFO][MergeBlastXml]opening /home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2
[INFO][MergeBlastXml]resolveEntity:-//NCBI//NCBI BlastOutput/EN/http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd/null
[SEVERE][MergeBlastXml]ParseError at [row,col]:[840759,6]
Message: The processing instruction target matching "[xX][mM][lL]" is not allowed.
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[840759,6]
Message: The processing instruction target matching "[xX][mM][lL]" is not allowed.
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.peek(XMLEventReaderImpl.java:276)
        at com.github.lindenb.jvarkit.tools.blast.MergeBlastXml.doWork(MergeBlastXml.java:210)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:747)
        at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:861)
        at com.github.lindenb.jvarkit.tools.blast.MergeBlastXml.main(MergeBlastXml.java:273)
ADD REPLY
0
Entering edit mode

what is the output of

xmllint ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1 > /dev/null
xmllint ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2 > /dev/null

?

I suspect your process have concatenated some XML files

ADD REPLY
0
Entering edit mode
I am sorry I couldn't respond earlier because of my limited posting option. 
Well, Itried and the output is like that:
tb44227@lido-gw02:~/nobackup/jvarkit>xmllint ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1 > /dev/null
/home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1:822600: parser error : XML declaration allowed only at the start of the document
<?xml version="1.0"?>
     ^
/home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp1:822601: parser error : Extra content at the end of the document
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm
^
tb44227@lido-gw02:~/nobackup/jvarkit>
tb44227@lido-gw02:~/nobackup/jvarkit>xmllint ~/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2 > /dev/null
/home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2:840759: parser error : XML declaration allowed only at the start of the document
<?xml version="1.0"?>
     ^
/home/tb44227/nobackup/Radula/Trinity_1500k_Radula_lnr_hededt_split/merging_xml_db1_db2/03_Rma_trn1500k_nrbx5_hsp20_e3_dbp2:840760: parser error : Extra content at the end of the document
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm
^
ADD REPLY
0
Entering edit mode

I think the script should be like:

First it should merge all the resulted xml files that was generated for one database 1 let's say in this case merge all 31 files of database1

Then it should some how sort the xml files. The reason to sort is queries are not insorted format,because I used GNU parrallel to split the blast queries over nodes.

Doing the same exercise for the database2.

Now we will have two files that contained the results for each query against each database.

Therefore script should be trained in such a way that it takes the first query from first file and copy the results of this specific query from both files

    BUT BEFORE Pasting the results into new file, a little bit modification is required.
    Each query has a maximum of 20 hits. it means the hit starts from 
        <Hit_num>1</Hit_num>
        .   
        .
        .
        .and ends on 
        <Hit_num>20</Hit_num>

    The problem is the hit number always starts form 1 to 20 in both databases.

I wonder whether it will effect when I use this file for the BLAST2GO? or we need to change the number of second part of th edatabase from

        <Hit_num>1</Hit_num>     TO  <Hit_num>21</Hit_num>
        .   
        .
        .
        .and  
        <Hit_num>20</Hit_num>    TO <Hit_num>40</Hit_num>

    So, After merging each query will have 40 hits
ADD REPLY
0
Entering edit mode

these are not valid xml files.

ADD REPLY
0
Entering edit mode

As I mentioned in the previous comment, because fo the GNU parallel each file has multiple blast xml header and footer.

I remove the repeated header and footer from the file and then tried your script, it worked and merged two files. But the resultant file in in one line

http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"><BlastOutput>
  <BlastOutput_program>blastx</BlastOutput_program>
  <BlastOutput_version>BLASTX 2.6.0+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>/scratch/tb44227/db/nrdb_1/nr</BlastOutput_db>
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>TRINITY_DN99193_c0_g5_i1 len=580</BlastOutput_query-def>
  <BlastOutput_query-len>580</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>0.001</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations><Iteration><Iteration_iter-num>1459</Iteration_iter-num><Iteration_query-ID>Query_1459</Iteration_query-ID><Iteration_query-def>TRINITY_DN100341_c0_g1_i1 len=1240</Iteration_query-def><Iteration_query-len>1240</Iteration_query-len><Iteration_hits></Iteration_hits><Iteration_stat><Statistics><Statistics_db-num>112583993</Statistics_db-num><Statistics_db-len>41258651338</Statistics_db-len><Statistics_hsp-len>156</Statistics_hsp-len><Statistics_eff-space>6089755946510</Statistics_eff-space><Statistics_kappa>0.041</Statistics_kappa><Statistics_lambda>0.267</Statistics_lambda><Statistics_entropy>0.14</Statistics_entropy></Statistics></Iteration_stat><Iteration_message>No hits found</Iteration_message></Iteration><Iteration><Iteration_iter-num>1375</Iteration_iter-num><Iteration_query-ID>Query_1375</Iteration_query-ID><Iteration_query-def>TRINITY_DN100342_c0_g1_i1 len=3033</Iteration_query-def><Iteration_query-len>3033</Iteration_query-len><Iteration_hits><Hit><Hit_num>1</Hit_num><Hit_id>gi|168037340|ref|XP_001771162.1|</Hit_id><Hit_def>predicted protein [Physcomitrella patens] >gi|162677542|gb|EDQ64011.1| predicted protein [Physcomitrella patens]</Hit_def><Hit_accession>XP_001771162</Hit_accession><Hit_len>1226</Hit_len><Hit_hsps><Hsp><Hsp_num>1</Hsp_num><Hsp_bit-score>58.5362</Hsp_bit-score><Hsp_score>140</Hsp_score><Hsp_evalue>7.71095e-05</Hsp_evalue><Hsp_query-from>2883</Hsp_query-from><Hsp_query-to>2996</Hsp_query-to><Hsp_hit-from>573</Hsp_hit-from><Hsp_hit-to>611</Hsp_hit-to><Hsp_query-frame>3</Hsp_query-frame><Hsp_hit-frame>0</Hsp_hit-frame><Hsp_identity>27</Hsp_identity><Hsp_positive>30</Hsp_positive><Hsp_gaps>1</Hsp_gaps><Hsp_align-len>39</Hsp_align-len><Hsp_qseq>RSYKFILTCYLERYINRWNPFTAKAMEVSHGREA-LVPA</Hsp_qseq><Hsp_hseq>RSYNFIVTCGLERYINIWNPFTAKTMGYLHGHEASVMPA</Hsp_hseq><Hsp_midline>RSY FI+TC LERYIN WNPFTAK M   HG EA ++PA</Hsp_midline></Hsp></Hit_hsps></Hit><Hit><Hit_num>1</Hit_num><Hit_id>gi|1026767740|gb|OAE27996.1|</Hit_id><Hit_def>hypothetical protein AXG93_1062s1280 [Marchantia polymorpha subsp. polymorpha]</Hit_def><Hit_accession>OAE27996</Hit_accession><Hit_len>883</Hit_len><Hit_hsps><Hsp><Hsp_num>1</Hsp_num><Hsp_bit-score>43.8986</Hsp_bit-score><Hsp_score>102</Hsp_score><Hsp_evalue>6.4385e-09</Hsp_evalue><Hsp_query-from>2228</Hsp_query-from><Hsp_query-to>2305</Hsp_query-to><Hsp_hit-from>321</Hsp_hit-from><Hsp_hit-to>346</Hsp_hit-to><Hsp_query-frame>2</Hsp_query-frame><Hsp_hit-frame>0</Hsp_hit-frame><Hsp_identity>17</Hsp_identity><Hsp_positive>23</Hsp_positive><Hsp_gaps>0</Hsp_gaps><Hsp_align-len>26</Hsp_align-len><Hsp_qseq>ILNAPLCATWLRLVENDKLVYGYDLG</Hsp_qseq><Hsp_hseq>LVNAPMCATWLRVAESDKLLYGDDRG</Hsp_hseq><Hsp_midline>++NAP+CATWLR+ E+DKL+YG D G</Hsp_midline></Hsp><Hsp><Hsp_num>2</Hsp_num><Hsp_bit-score>39.2762</Hsp_bit-score><Hsp_score>90</Hsp_score><Hsp_evalue>6.4385e-09</Hsp_evalue><Hsp_query-from>2472</Hsp_query-from><Hsp_query-to>2627</Hsp_query-to><Hsp_hit-from>421</Hsp_hit-from><Hsp_hit-to>460</Hsp_hit-to><Hsp_query-frame>3</Hsp_query-frame><Hsp_hit-frame>0</Hsp_hit-frame><Hsp_identity>23</Hsp_identity><Hsp_positive>28</Hsp_positive><Hsp_gaps>12</Hsp_gaps><Hsp_align-len>52</Hsp_align-len><Hsp_qseq>KWAAIEHAHGIFGICGCEYCRWMFSLHRSMAKLLHCHYQRLGNRWTLDPFVA</Hsp_qseq><Hsp_hseq>KWTAKEH............
ADD REPLY

Login before adding your answer.

Traffic: 2989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6