Question

How Do I Download All Full-Text Papers From A Dspace Repository?

3

Entering edit mode

14.4 years ago

Lars Juhl Jensen 11k

As some of you may know, I work (among many other things) on text mining. A prerequisite for doing text mining is obviously to actually have the text, which due to pay walls is not as easy as one could wish. Supposedly, a good number of papers published behind pay walls are deposited by the authors in their respective institutional repositories, many of which run the DSpace software.

My question is thus: how do I download the entire collection of full-text papers that have been deposited in a DSpace repository?

I am very familiar with wget and curl, but I hope there is a better way to get all papers than to mirror the entire repository and subsequently sort out the mess of files and directories.

DSpace papers • 5.5k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 14.4 years ago by Lars Juhl Jensen 11k

score 6 · Answer 1 · 2010-08-21

6

Entering edit mode

14.4 years ago

Laurent Gautier ▴ 810

What about dspace or ready-made python executables?

ADD COMMENT • link 14.4 years ago by Laurent Gautier ▴ 810

1

Entering edit mode

dspace.repository.Repository can get handles or items (therefore provide an easy way to harvest the documents, I think).

ADD REPLY • link 14.4 years ago by Laurent Gautier ▴ 810

0

Entering edit mode

Thanks for the suggestions - I will look into them. It looks to me as if the dspace Python package can harvest only metadata, not the actual documents, though.

ADD REPLY • link 14.4 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Yes, that indeed looks promising. Once I have the handles it should be easy to download the papers ... I hope.

ADD REPLY • link 14.4 years ago by Lars Juhl Jensen 11k

Ram · Answer 2 · 2012-04-04

0

Entering edit mode

12.8 years ago

Ankur ▴ 20

You can use EUtils SOAP web services from NCBI. EFetch would do the trick for you. give the database parameter as Pubmed.

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 12.8 years ago by Ankur ▴ 20