As some of you may know, I work (among many other things) on text mining. A prerequisite for doing text mining is obviously to actually have the text, which due to pay walls is not as easy as one could wish. Supposedly, a good number of papers published behind pay walls are deposited by the authors in their respective institutional repositories, many of which run the DSpace software.
My question is thus: how do I download the entire collection of full-text papers that have been deposited in a DSpace repository?
I am very familiar with wget and curl, but I hope there is a better way to get all papers than to mirror the entire repository and subsequently sort out the mess of files and directories.
dspace.repository.Repository can get handles or items (therefore provide an easy way to harvest the documents, I think).
Thanks for the suggestions - I will look into them. It looks to me as if the dspace Python package can harvest only metadata, not the actual documents, though.
Yes, that indeed looks promising. Once I have the handles it should be easy to download the papers ... I hope.