What tools are useful for text mining of pdf-based literature? For example, suppose I had a list of several genes and several phenotypes, and wanted to look for associations between those genes and phenotypes in literature for which a PDF is available, but HTML of the full text is not. Are there tools to efficiently do this type of search?
I have installed - but never used - Xapers, which can index pdf files and other sources. I don't know if you are looking for a fancy machine-learning kind of stuff, or simple indexing and searching are good enough for your purposes.
There is also pdfgrep, which could be nice for quickly searching a few pdfs.
pdftotext is the best soln I have tried so far.
not a direct answer to your request but perhaps this resource might be of use EVEX (not sure though how well maintained it still is)