I am trying to load pubmed locally. I downloaded all of the pubmed XML files provided by NCBI to create a local copy of pubmed. I searched if anyone has done this before and I found a good paper "Tools for loading MEDLINE into a local relational database" and many other sources that talk about this. Another source I would like to mention is http://biostar.stackexchange.com/questions/10049/how-do-people-go-about-pubmed-text-mining.
I have parsed the XML files into flat files. I decided to try loading a sample data into mysql, try some queries and look how it works.
Here is what I am looking for in the local copy of pubmed:
-I have a dictionary of terms that I want to search in pubmed and get the abstracts.
In order to achieve my goal I am trying to load the data into mysql and use http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html to query database. I think full-text indexing is a good option as I will be looking through the text in abstract column. I think, this will make my task easy.
Now my concern is:
--Is it the right approach, I am following?
--I read some reviews about full-text indexing and the required computational time. I am afraid how it will work for 18 million pubmed records.
--Is there any better way than this or I am on the right track?
--Can I also include other parser that will help me for natural language processing post-querying the database for abstracts? (I know I can, but is it a good idea to include it now or later?)
--If full-text indexing is a good idea, when shall I do it? while I populate the database or after I populate the database?
--Major concern is how do I query using dictionary of terms? I have two separate dictionaries and I want to use them both with, maybe, Boolean operator. I tried using eutils with perl to get the abstracts, but as the list of terms in dictionary is in thousands it takes much time computationally to get the abstracts. Using perl and eutils I know how to query the database, but how can I do it once I get a local copy of the database? Can I do it using perl?
I hope, I have put my question in more clear sense! Just let me know if I haven't and I will try to improve on it. Any help is greatly appreciated. Thank you.
Why would you want to import the data into a relational DB ? If you already have structure and hierarchy somehow thanks to the XML, what's the advantage of flattenning your data?
@Pablo Pareja: It's not easy to create a full text index on specific parts of an XML file and XML files cannot be queried easily compared to data stored in a relational database.
Well, I was not talking about using the raw XML files just like that. Obviously you should parse them and first of all extract only the information you are interested in. Then I'd suggest storing the structured data in either a native graph oriented db like Neo4j or a XML native db like berkleyXML. Besides you could use a standard full-text indexer like Lucene (as @GWW mentions) on top of that for the sub-sets of data you want to do exhaustive text based searches, (Neo4j already includes it as part of the DB)
hehehe thought I was replying to smandape instead of you @GWW, sorry for that ;)
Hello Can you tell me how did you download all of the pubmed XML files?