I am interested in doing some natural language processing on the corpus of MEDLINE/PubMed journal citations.
I have recently become aware of Apache Solr, which is a Lucene-based search server. I have no experience with Solr, but from what I can tell it seems to be a good way to go about tackling my task.
I'm wondering if anyone has experience with something like this and may have some insights to share. For example, how should the Solr schema be devised? What has to be done with the XML files so that they are in a format that Solr can index?
Aside from two posts here and here, which are old and don't offer any detailed advice, there is not much to be found on this topic. There is another post here about indexing the Gene Ontology with Solr, but I'm not quite sure how to translate the advice into what is necessary for MEDLINE/PubMed.
Thanks for the comment. I wasn't necessarily making any assumptions about the difficulty of indexing . I do appreciate (at least to some degree) the challenge of indexing the journal citations.
I have previously parsed the xml citations and loaded them into a relational database. However, I've recently come across a number of references to Lucene/Solr, and it appears that more people are using it for this type of task. I'm thinking that maybe this is a better approach than the way I was originally doing it. I have no background in this area, so I'm just trying to solicit some guidance or anecdotes from others who have done similar things (with MEDLINE/PubMed is a bonus).
I will search the literature to see if I can find some useful information to get me started.
Thanks.