Local Copy Of Pubmed
5
4
Entering edit mode
13.2 years ago
Smandape ▴ 120

I am trying to load pubmed locally. I downloaded all of the pubmed XML files provided by NCBI to create a local copy of pubmed. I searched if anyone has done this before and I found a good paper "Tools for loading MEDLINE into a local relational database" and many other sources that talk about this. Another source I would like to mention is http://biostar.stackexchange.com/questions/10049/how-do-people-go-about-pubmed-text-mining.

I have parsed the XML files into flat files. I decided to try loading a sample data into mysql, try some queries and look how it works.

Here is what I am looking for in the local copy of pubmed:

-I have a dictionary of terms that I want to search in pubmed and get the abstracts.

In order to achieve my goal I am trying to load the data into mysql and use http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html to query database. I think full-text indexing is a good option as I will be looking through the text in abstract column. I think, this will make my task easy.

Now my concern is:

--Is it the right approach, I am following?

--I read some reviews about full-text indexing and the required computational time. I am afraid how it will work for 18 million pubmed records.

--Is there any better way than this or I am on the right track?

--Can I also include other parser that will help me for natural language processing post-querying the database for abstracts? (I know I can, but is it a good idea to include it now or later?)

--If full-text indexing is a good idea, when shall I do it? while I populate the database or after I populate the database?

--Major concern is how do I query using dictionary of terms? I have two separate dictionaries and I want to use them both with, maybe, Boolean operator. I tried using eutils with perl to get the abstracts, but as the list of terms in dictionary is in thousands it takes much time computationally to get the abstracts. Using perl and eutils I know how to query the database, but how can I do it once I get a local copy of the database? Can I do it using perl?

I hope, I have put my question in more clear sense! Just let me know if I haven't and I will try to improve on it. Any help is greatly appreciated. Thank you.

mysql data • 8.0k views
ADD COMMENT
0
Entering edit mode

Why would you want to import the data into a relational DB ? If you already have structure and hierarchy somehow thanks to the XML, what's the advantage of flattenning your data?

ADD REPLY
0
Entering edit mode

@Pablo Pareja: It's not easy to create a full text index on specific parts of an XML file and XML files cannot be queried easily compared to data stored in a relational database.

ADD REPLY
0
Entering edit mode

Well, I was not talking about using the raw XML files just like that. Obviously you should parse them and first of all extract only the information you are interested in. Then I'd suggest storing the structured data in either a native graph oriented db like Neo4j or a XML native db like berkleyXML. Besides you could use a standard full-text indexer like Lucene (as @GWW mentions) on top of that for the sub-sets of data you want to do exhaustive text based searches, (Neo4j already includes it as part of the DB)

ADD REPLY
0
Entering edit mode

hehehe thought I was replying to smandape instead of you @GWW, sorry for that ;)

ADD REPLY
0
Entering edit mode

Hello Can you tell me how did you download all of the pubmed XML files?

ADD REPLY
5
Entering edit mode
13.2 years ago
Gareth Palidwor ★ 1.6k

I recommend Lucene or a text indexer, SQL is not the appropriate tool for working with Medline in my experience.

Lucene is very different; you decide what to index, Lucene then can identify records on that basis. I don't recommend embedding the record in the lucene index (which is possible) as it seriously degrades performance. I've had good results indexing the byte position of PMIDs in the (uncompressed) files and retrieving them as required using a file seek.

Remember that with Lucene you don't have to put just the text in the index, you can also index and store derived properties at index time. For example, it would be very expensive to retrieve the XML records and count the number of words in an abstract for thousands of abstracts on a query, but you can store the number of words in an abstract in the record at index time. Re-indexing should only take a few hours so it's easy enough to incrementally add derived properties. Perhaps that's when you could do the NLP analysis.

Lucene indexes are crazy fast, here's a fun implementation of a Google-trends style search I published a while ago. It's querying for each search term over all MEDLINE records for each year individually; ~60 queries per search term, it generally can do the search in seconds if not milliseconds.

http://www.ogic.ca/mltrends/

Also: Lucene supports boolean search terms, fuzzy searches, wildcards etc. You have to pay close attention to your tokenizing and indexing options however or it may act as you expect.

edited to fix typos and add stuff about boolean searches

ADD COMMENT
1
Entering edit mode

@gawp thank you for your answer. Can you elaborate more on "Perhaps that's when you could do the NLP analysis."

ADD REPLY
0
Entering edit mode

Lets say you want do run some sort of NLP analysis of the abstracts like identify lists of genes referenced or word disambiguation or even sentiment analysis. At index time you read through each XML record for lucene indexing, but you can also run your per-record NLP analysis then, storing the result in the record so it is a query-able element. You could then retrieve records with specific genes referenced, or a specific usage of a word (disambiguated), or a positive sentiment expressed.

This may not make sense depending

ADD REPLY
0
Entering edit mode

Lets say you want do run some sort of NLP analysis of the abstracts like identify lists of genes referenced or word disambiguation or even sentiment analysis. At index time you read through each XML record for lucene indexing, but you can also run your per-record NLP analysis then, storing the result in the record so it is a query-able element. You could then retrieve records with specific genes referenced, or a specific usage of a word (disambiguated), or a positive sentiment expressed.

ADD REPLY
3
Entering edit mode
13.2 years ago
Gww ★ 2.7k

You may want to look into a full-text indexer optimized for this task such as Xapian or Apache Lucene. The quality of the search results and the speed of the queries will be better. Furthermore, you won't have to load everything from the XML files in the index for these tools; just the parts you want to search and the pubmed ID or PMC id.

ADD COMMENT
0
Entering edit mode

@GWW and @Pablo Pareja thank you for your help. So, in this case I can use Lucene in perl. I went through lucene and Neo4j. Just correct me if I am wrong, if I decide to go with lucene my approach can be to create a full-text index and search it using lucene. I can query the database using Plucene and then I don't need to use anything else, am I correct?

ADD REPLY
3
Entering edit mode
13.2 years ago
Yogesh Pandit ▴ 520

If you need a local copy I would recommend using MySQL (or any other) for storage. MEDLINE will have only the abstracts and not full-text documents. You can follow this link to get your MEDLINE XML files into a local storage server: LingPipe Database Textmining

Once your have this local copy you will not have to worry about NCBI usage policy. Then using the dictionary you can query the database to get all the relevant abstracts. Your query can be like SELECT abstract FROM <dbname> WHERE abstract LIKE "%term%";. This will also help you filter out the non relevant documents. The retrieved abstracts will not be more than 20 sentences each. You can iterate this process using DBI modules in Perl.

If this data is going to be searched again and again, one way can be you can store the dictionary with a references to the PMIDs and use that to retrieve abstracts later.

If you want to index all the documents, including the non-relevant one's this presentation has pointers on using Lucene with MySQL

ADD COMMENT
3
Entering edit mode
13.2 years ago
Bio_Neo ▴ 30

One alternative strategy could be using a NoSQL database like CouchDB or MongoDB to store the PubMed data. You can index on the MeSH terms, title and abstract using the built in indexing system. For more details you can look at this link for a nice tutorial on storing PubMed data with MongoDB.

ADD COMMENT
2
Entering edit mode
13.2 years ago
Rama ▴ 40

Hi,

Textpresso does parts of what you want to do. Check them out.

r

ADD COMMENT

Login before adding your answer.

Traffic: 1792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6