Question

Forum:How Does Mekentosj Papers Work?

1

Entering edit mode

13.2 years ago

Jeremy Leipzig 23k

How is the http://www.mekentosj.com/papers program able to import a PDF of an article and know the details of fields like Author, Title, and Journal? Are these fields encoded in some metadata within the PDF itself or is it looking them up on the web, and if so, how does it do that?

enter image description here

papers • 3.5k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 13.2 years ago by Jeremy Leipzig 23k

score 0 · Answer 1 · 2012-05-21

0

Entering edit mode

13.2 years ago

Istvan Albert 103k

Wouldn't that be some sort of trade secret?

I could imagine several heuristics that would greatly facilitate the process. For example comparing the title/body to a known database of papers: that are either leased from ISI, collected from properly formatted sources like the publisher or just a looked up in existing papers by all users. Add to that an attempt to parse the paper itself to fall back to.

ADD COMMENT • link 13.2 years ago by Istvan Albert 103k

0

Entering edit mode

well I would hope to take advantage of the mechanism so I could produce PDFs of analysis with relevant metadata that Papers could understand.

ADD REPLY • link 13.2 years ago by Jeremy Leipzig 23k

score 0 · Answer 2 · 2012-05-28

Hi, there are actually multiple ways to extract document details from PDF documents. First and foremost, you can extract raw text from PDF documents using multiple tools and libraries. More over, there are standard formats for storing metadata along with documents. In the case of PDF articles, one widely used format is XMP. PDFs that have XMP packed in nicely are very easy to process as this is a standard format and you should be able to extract metadata with some ease. I'm not certain how Papers does it but I can tell what Mendeley does. At least in a broad sense. When you import a paper into Mendeley desktop, the application will attempt to read the full text of the PDF and using some fancy data mining techniques, extract the document details (title, abstract, authors, etc). Obviously, if the PDF is recent and has a standard XMP, this is rather straightforward and quick. Older papers and those that are not typically standard compliant, this is a bit more complicated. There are failsafe measures that also help keep metadata clean in Mendeley, which involve taking identifiers from the fulltext and performing online queries to certify the metadata is correct and complete.

As you can see, there is much that can be done and a lot of the libraries are openly and freely available. However, the implementation can in fact be a "trade secret", as Istvan said. :)

Short disclaimer: I'm a happy Mendeley user since late 2008 and also currently a community liaison at Mendeley.