It would be more sensible to generate a smaller XML file in the first instance, using BLAST command-line parameters. But perhaps you did not generate the original file?
Yes I next time I'll reduce the number of reported HSPs. I was trying to be smart by keeping everything and didn't think parsing would be a problem later.
If you are interested in parsing some information from the big XML file you can use the event parsing strategy. lxml in Python has great event parser. Please drop a line after this comment if you are interested in this approach and I will post a detailed methodology.
@Bio_Neo: Could you describe this in more detail? I have a BLAST result with top 50 hits of a large multifasta in XML format. I need to reduce this to top 5 hits for using for some downstream processes. Could you explain how to use the event parsing strategy for this?
To extract a subpart of a large XML file, the idea is to use a XML pull parser. Read and echo each node of your XML until you find a [?]Hit[?] element. Then, parse but only echo the blocks of XML elements you need.
In java, a Pull parser specific to a given DTD can be generated with xjc.
You could compress the file using something like parallel bzip then read from the file as a compressed IO pipe. Obviously this only reduces the physical size of the file not its contents. You'll also expect a longer running time too I imagine.
A solution for the future would perhaps be to bunch queries into multiple files, so that you will have multiple smaller XML outputs, which you could easily process using Biopython, and perhaps even in parallel, if you use a cluster.
The file size isn't so much an issue but the time it takes to iterate through the files that's annoying.
The whole reason I'm trying to do this is to get the full description of the hit. My solution now has been to directly parse the human readable default blast out to tab delimited.
Have you had any problems parsing the tabbed output? I am more 'comfortable' with XML, since I am confident that it will be correctly and reliably parsed (XML is for parsing anyways, right?).
It would be more sensible to generate a smaller XML file in the first instance, using BLAST command-line parameters. But perhaps you did not generate the original file?
Currently Biopython parses BLAST XML, but doesn't write it out again. That would be the most elegant Biopython-based solution to your needs.
Otherwise you'll need to write something specific, possibly using one of the built in Python XML libraries, or hand coded for this specific need.
Yes I next time I'll reduce the number of reported HSPs. I was trying to be smart by keeping everything and didn't think parsing would be a problem later.
If you are interested in parsing some information from the big XML file you can use the event parsing strategy. lxml in Python has great event parser. Please drop a line after this comment if you are interested in this approach and I will post a detailed methodology.
@Bio_Neo: Could you describe this in more detail? I have a BLAST result with top 50 hits of a large multifasta in XML format. I need to reduce this to top 5 hits for using for some downstream processes. Could you explain how to use the event parsing strategy for this?