Text mining in PDFs
2
0
Entering edit mode
6.2 years ago
Matt LaFave ▴ 310

What tools are useful for text mining of pdf-based literature? For example, suppose I had a list of several genes and several phenotypes, and wanted to look for associations between those genes and phenotypes in literature for which a PDF is available, but HTML of the full text is not. Are there tools to efficiently do this type of search?

text mining • 1.3k views
ADD COMMENT
1
Entering edit mode

pdftotext is the best soln I have tried so far.

ADD REPLY
0
Entering edit mode

not a direct answer to your request but perhaps this resource might be of use EVEX (not sure though how well maintained it still is)

ADD REPLY
4
Entering edit mode
6.2 years ago
Joe 21k

Something like https://github.com/kermitt2/grobid perhaps?

There are lots of PDF mining repos on github that I’d suggest having a mooch through.

I’ve definitely come across libraries for extracting data from graphs too but can’t for the life of me find the repos now...

ADD COMMENT
2
Entering edit mode
6.2 years ago
h.mon 35k

I have installed - but never used - Xapers, which can index pdf files and other sources. I don't know if you are looking for a fancy machine-learning kind of stuff, or simple indexing and searching are good enough for your purposes.

There is also pdfgrep, which could be nice for quickly searching a few pdfs.

ADD COMMENT

Login before adding your answer.

Traffic: 2608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6