How Do I Download All Full-Text Papers From A Dspace Repository?
2
3
Entering edit mode
14.3 years ago

As some of you may know, I work (among many other things) on text mining. A prerequisite for doing text mining is obviously to actually have the text, which due to pay walls is not as easy as one could wish. Supposedly, a good number of papers published behind pay walls are deposited by the authors in their respective institutional repositories, many of which run the DSpace software.

My question is thus: how do I download the entire collection of full-text papers that have been deposited in a DSpace repository?

I am very familiar with wget and curl, but I hope there is a better way to get all papers than to mirror the entire repository and subsequently sort out the mess of files and directories.

DSpace papers • 5.5k views
ADD COMMENT
6
Entering edit mode
14.3 years ago

What about dspace or ready-made python executables?

ADD COMMENT
1
Entering edit mode

dspace.repository.Repository can get handles or items (therefore provide an easy way to harvest the documents, I think).

ADD REPLY
0
Entering edit mode

Thanks for the suggestions - I will look into them. It looks to me as if the dspace Python package can harvest only metadata, not the actual documents, though.

ADD REPLY
0
Entering edit mode

Yes, that indeed looks promising. Once I have the handles it should be easy to download the papers ... I hope.

ADD REPLY
0
Entering edit mode
12.7 years ago
Ankur ▴ 20

You can use EUtils SOAP web services from NCBI. EFetch would do the trick for you. give the database parameter as Pubmed.

ADD COMMENT

Login before adding your answer.

Traffic: 1645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6