Entering edit mode
7.0 years ago
Biologist123
•
0
Hi folks,
I have been trying to work out how to compare two or more BLAST output files with biopython and either (a) remove hits from these files where the output sequence ID is the same between files (i.e. hits that are common between multiple BLAST searches); or (b) preserve only those hits that are common between those files. Anyone got any advice on how to script this?
Thanks. :)
If you don't mind not using python, a lot of this hard work has been done for you in commandline tools like
diff
. You'd need to sort the files equivalently first, but it's easy to output just the relevant lines after that.Even better, I use
icdiff
(https://github.com/jeffkaufman/icdiff) which colourises the output in an intelligent way.If you want to start doing comparisons based on numerical fields though, (e.g. keep all lines that are different, but with an E-value <0.1 for example), then I would go with Sej's suggestion.
Just to throw the cat amongst the pidgeons too, you can do similar manipulations with the
csv
package of python which is in the standard library, if you're just interested in string comparisons for example.