Forum:Large Tables In Supplementary Pdf In Journal Articles
7
6
Entering edit mode
11.4 years ago
dario.garvan ▴ 520

What are you opinions on this ? When a table is embedded in a PDF, it can't be correctly pasted into a spreadsheet program, especially if it spans multiple pages. This makes computational operations on the table impossible. It would be easy for the journals to have rules for providing tables as CSV or XLS.

publication • 7.5k views
ADD COMMENT
0
Entering edit mode

moving to "Forum"

ADD REPLY
5
Entering edit mode
11.4 years ago
Michael 55k

I agree that this is annoying and obfuscating, whether intentional or not I do not know. I do not fully understand, how someone who has even the most minimal skills in computing could have such an amazingly stupid idea and store computational data in PDF, pdf is for documents to look identical on any screen and printer not to store data. Was it the journals allowing only PDF or the authors, while assuming scientists are at least half-way intelligent people (or possibly not)?

Anyway it is possible to break the obfuscation, sometimes at least. Recently I wanted to extract data from this supplementary: http://www.nature.com/nbt/journal/v23/n8/extref/nbt1118-S4.pdf

To get the text I used pdf2txt.py with pdf2txt.py -o data/nbt.txt -W 10000 -M 1000 -L 1000 nbt1118-S4.pdf, experimented a bit with the options. The output still looks weird, because the column headers are given as one letter per row, the table is fragmented, contains special characters, etc.

Using the following perl-script, I was able to retrieve a clean tab-separated table, I wasn't able to retrieve the correct meta-data, given by an "X" in certain columns though. I think it is possible, one has to count the white-spaces, but I didn't have time to test it.

#!/usr/bin/env perl
use strict;
use warnings;

while (<>) {
# that's a line
#16846 AAGUAUAAAAGUUUAGUGUtc X X                     0.761
  chomp;
  my @ar = m/\s*(\d+)\s+(\w+)[X ]+\s*(\d\.\d+)/;
  if (@ar) {
    print join ("\t", @ar), "\n";
  } 
}
ADD COMMENT
0
Entering edit mode

Please note that this is not a suggestion for a solution, quite the opposite, this is totally messy and is meant to show the absurdity of such attempts. There is further no guarantee in any of the mentioned approaches that the data that is extracted is completely correct. What I mean to say is that you can probably put a nail into the wall using a forceps instead of a hammer, but that you can doesn't mean you should.

ADD REPLY
4
Entering edit mode
11.4 years ago
sarahhunter ▴ 600

You might find some of the work done by Manchester University on extracting information from PDFs interesting, particularly Utopia documents: see "Calling International Rescue: knowledge lost in literature and data landslide" by Attwood et al.

ADD COMMENT
4
Entering edit mode
11.4 years ago
Ben ★ 2.0k

There are some online tools that can do this pretty well, e.g. this seems to correctly convert everything from the pdf example in Michael's post, and Zamzar seems to get everything except the column headers. They split the pdf pages across sheets in excel but I guess you could then write to csv and concatenate; not an ideal solution but good enough for the odd occasion you have to do this sort of thing.

I expect authors do this due to author guidelines or submission systems that ask for supplementary data as pdf, they're not using the data themselves in pdf form and inevitably realise its limitations.

ADD COMMENT
0
Entering edit mode

The conversions are in fact really good, but there are downsides: Nitro cloud allows only 5 free conversions without registration, with account it is limited to 5 per month (or pay, or have some spare email addresses ;). As you say, all tables on different pages are rendered in their own sheet. To export them manually, this makes 49 save as CSV clicks. I think, a VBA script to join all sheets into a single one first, should be a bit more comfortable (and easy too if you know VBA). Maybe like this?

ADD REPLY
0
Entering edit mode

I tried the Macro and it worked with the Zamzar converted xls, after...

  • saving as "Macro enabled excel workbook" to allow for macros
  • removing additional text rows (header and footnotes) on sheet 1 and 49
  • sheet 48 had an extra empty column B from the conversion, removing empty column B shifting all columns right, no left!

after that, I have single sheet with 2384 rows + header.

I think we have made a great step ahead towards reproducible science </irony>

ADD REPLY
2
Entering edit mode
11.4 years ago
Neilfws 49k

There are solutions, as outlined in some of the answers but do not expect them to work well (if at all) for all cases.

What you need to understand is that historically, the role of journals has not been to provide data in a usable form. The PDF is simply a hangover from the days of print. Journals assume that all most readers want to do is print out material to take away and read. The idea that people might want to mine published material is rather strange to many publishers and indeed some would actively seek to prevent it.

I can only suggest we all lobby journals to adopt more modern policies regarding provision of data.

ADD COMMENT
0
Entering edit mode

I think you are correct with your assumption about historical reasons, but in addition there seems to be lack of consciousness for reproducible science. In fact the fragility and complexity of all the demonstrated methods ridicules the journal's approach for handling supplementary data and provides the best argument for why one does not want to do it that way, we even found the need for using OCR software to read sequences from image files.

ADD REPLY
0
Entering edit mode

I agree this appears to be a leftover from the premodern days of Charles Dickens, but I don't think it's purposefully obfuscatory (although it can have this effect when you can't...get...at...the...data.) Most manuscript submission websites seem to convert your stuff to .pdf, and there it stays. And there appears to be a disconnect between the editor/associate editor (who would understand our issues) and the guys who come in later to do the proofs (who are not scientists). This whole concept of "supplementary data" has grown silly anyway -- that's usually where the main work is that you need access to. A larger change in publishing format is clearly needed.

ADD REPLY
1
Entering edit mode
11.4 years ago

PDF is format for graphics or maybe English text. People should not use it for tables or any data that they want to be reused.

For the special case of tables in PDFs, this problem is annoying and common enough (government data) that someone wrote a special converter, just for tables in PDFs: https://github.com/jazzido/tabula. I have never tried it but it includes special code to identify rows, remove headers and pagebreaks, so it really should work better.

Mary: If authors report motifs as graphics in a PDF, then the only motivation I can see is that they don't want their data be used or they forgot to provide it. You should email Matthieu Blanchette and ask for the raw data, which he definitely has. He is most likely aware of the problem. (If he doesn't reply: one of my colleagues works for him.)

As a general-purpose solution, I got very good results with an OCR software like Omnipage or Abbyy. It often produces good XML or at least HTML from PDFs that for some reason fail with pdftotext. You can give the java-based pdfbox a try or python's pyPdf or pdfMiner cited above, there is not a lot of difference between these tools in my hands.

If you want to write something yourself, which I don't recommend, you need one of these pdf-extract libraries. They give you access to each individual character on all pages and allow you to find out fontsize, fonttype, position, etc. Cermine is supposedly a good tool for this, but I haven't tried it, see http://sciencesoft.web.cern.ch/node/120

For anyone working in text mining, PDFs are a time-consuming obstacle but the de facto standard for scientific text. Tools like Papers or Google Scholar's parsers have to use various rules to find out the author names, titles and abstract from a PDF. They go for the biggest font on the first page (title), non-English text underneath (authors), and maybe a single paragraph of intended or bold text then (abstract). Another technique is to look for a DOI that is easily recognizable with a regular expression and then lookup the data in CrossRef.

ADD COMMENT
0
Entering edit mode

there is a pdf metadata standard described here: How does Mekentosj Papers work?

ADD REPLY
0
Entering edit mode

Hi Max! I doubt they are trying to be hard to work with, and I'd definitely contact them if I really need these later. But I just thought that paper which I just happened to be reading was a great test of some of these strategies. Maybe there's something I don't understand about this format though. Have a look at the supplement and tell me if there's something I'm missing about it.

ADD REPLY
0
Entering edit mode
11.4 years ago
Mary 11k

Arrgggh...I had my first opportunity to try some of these converter tools out. I was hoping to get out those motifs in supplementary table 7. There's a lot of 'em. But it looks like this supplement is made of images. I tried Zamzar and got 64 tabs of nothing.

http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.2684.html

I loved the paper anyway. But that was a bear.

ADD COMMENT
0
Entering edit mode

Maybe some OCR software could.... but no, this is just absurd.

ADD REPLY
0
Entering edit mode
11.4 years ago

I think this is a problem that existing tools can start to tackle. Adobe largely gave up on Flash and I think it's in their best interest to work on making PDFs more open, as well.

http://tv.adobe.com/watch/accessibility-adobe/acrobat-tagging-pdf-content-as-a-table/

ADD COMMENT

Login before adding your answer.

Traffic: 3010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6