Question

Text mining in PDFs

0

Entering edit mode

6.2 years ago

Matt LaFave ▴ 310

What tools are useful for text mining of pdf-based literature? For example, suppose I had a list of several genes and several phenotypes, and wanted to look for associations between those genes and phenotypes in literature for which a PDF is available, but HTML of the full text is not. Are there tools to efficiently do this type of search?

text mining • 1.3k views

ADD COMMENT • link updated 6.2 years ago by h.mon 35k • written 6.2 years ago by Matt LaFave ▴ 310

1

Entering edit mode

pdftotext is the best soln I have tried so far.

ADD REPLY • link 6.2 years ago by btsui ▴ 300

0

Entering edit mode

not a direct answer to your request but perhaps this resource might be of use EVEX (not sure though how well maintained it still is)

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

score 4 · Answer 1 · 2018-09-27

4

Entering edit mode

6.2 years ago

Joe 21k

Something like https://github.com/kermitt2/grobid perhaps?

There are lots of PDF mining repos on github that I’d suggest having a mooch through.

I’ve definitely come across libraries for extracting data from graphs too but can’t for the life of me find the repos now...

ADD COMMENT • link 6.2 years ago by Joe 21k

score 2 · Answer 2 · 2018-09-28

I have installed - but never used - Xapers, which can index pdf files and other sources. I don't know if you are looking for a fancy machine-learning kind of stuff, or simple indexing and searching are good enough for your purposes.

There is also pdfgrep, which could be nice for quickly searching a few pdfs.