Importing Pubmed Medline Details Into A Local Rdbms To Execute Data Mining Methods
2
0
Entering edit mode
11.1 years ago

Hi everyone,

I want to execute Data Mining methods on a PubMed dataset (Medline in XML). Regarding this aim I found a paper from 2004 "Software to parse and load MEDLINE into a RDBMS " and want to execute the java code (http://biotext.berkeley.edu/software.html). I can't get the MedinlineParser work - probably its an problem of JAXP or other older libraries. Furthermore I don't find any recent solutions to mine a PubMed dataset (XML files) directly or firstly get it into a local RDBMS.

Are there any working solutions? Maybe a XSLT Stylesheet?

I would be very grateful if you could help me to find a solution.

Best regards, Mark

xml database pubmed • 3.7k views
ADD COMMENT
0
Entering edit mode

note: Mark asked me his question by mail, and I suggested him to use biostars.org to get the answers from the community.

ADD REPLY
1
Entering edit mode
ADD REPLY
1
Entering edit mode
11.1 years ago

When http://nodalpoint.org/ was still alive (... ;-) ) I suggested to use a XSLT stylesheet to import a pubmed xml into a database. I quickly wrote a XSLT to insert the some pubmed articles into a sqlite3 database. See https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/pubmed2sqlite.xsl . here , I only use 3 tables but the schema could be far more complicated.

$ xsltproc --novalid  stylesheets/bio/ncbi/pubmed2sqlite.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=9891771,21378989&retmode=xml"

create table if not exists Journal
    (
    nlmUniqueID  TEXT UNIQUE NOT NULL,
    medlineTA TEXT
    );


create table if not exists PubmedArticle
    (
    pmid INT UNIQUE NOT NULL,
    title TEXT,
    abstract TEXT,
    nlmUniqueID TEXT,
    FOREIGN KEY(nlmUniqueID ) REFERENCES Journal(nlmUniqueID)
    );

create table if not exists Author
    (
    lastName TEXT,
    foreName TEXT,
    pmid INT NOT NULL,
    position INT,
    FOREIGN KEY(pmid ) REFERENCES PubmedArticle(pmid)
    );

create unique index if not exists Author2Article on Author(lastName,foreName,pmid);
begin transaction;
insert or ignore into Journal(nlmUniqueID,medlineTA) values ('7609767','Ann Chir Gynaecol');
insert or ignore into PubmedArticle(pmid,title,abstract,nlmUniqueID) values ('9891771','Prognosis and surveillance of gastrointestinal stromal/smooth muscle tumors.','','7609767');
insert or ignore into Author(lastName,foreName,pmid,position) values ('Emory','T S','9891771',1);
insert or ignore into Author(lastName,foreName,pmid,position) values ('O''Leary','T J','9891771',2);
insert or ignore into Journal(nlmUniqueID,medlineTA) values ('9216904','Nat Genet');
insert or ignore into PubmedArticle(pmid,title,abstract,nlmUniqueID) values ('21378989','Truncating mutations in the last exon of NOTCH2 cause a rare skeletal disorder with osteoporosis.','Hajdu-Cheney syndrome is a rare autosomal dominant skeletal disorder with facial anomalies, osteoporosis and acro-osteolysis. We sequenced the exomes of six unrelated individuals with this syndrome and identified heterozygous nonsense and frameshift mutations in NOTCH2 in five of them. All mutations cluster to the last coding exon of the gene, suggesting that the mutant mRNA products escape nonsense-mediated decay and that the resulting truncated NOTCH2 proteins act in a gain-of-function manner.','9216904');
insert or ignore into Author(lastName,foreName,pmid,position) values ('Isidor','Bertrand','21378989',1);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Lindenbaum','Pierre','21378989',2);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Pichon','Olivier','21378989',3);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Bézieau','Stéphane','21378989',4);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Dina','Christian','21378989',5);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Jacquemont','Sébastien','21378989',6);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Martin-Coignard','Dominique','21378989',7);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Thauvin-Robinet','Christel','21378989',8);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Le Merrer','Martine','21378989',9);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Mandel','Jean-Louis','21378989',10);
insert or ignore into Author(lastName,foreName,pmid,position) values ('David','Albert','21378989',11);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Faivre','Laurence','21378989',12);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Cormier-Daire','Valérie','21378989',13);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Redon','Richard','21378989',14);
insert or ignore into Author(lastName,foreName,pmid,position) values ('Le Caignec','Cédric','21378989',15);

commit transaction;

then

$ xsltproc --novalid  stylesheets/bio/ncbi/pubmed2sqlite.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=9891771,21379889&retmode=xml"| sqlite3 test.db

$ sqlite3 test.db 'select * from Journal'
7609767|Ann Chir Gynaecol
101484507|Prov Med Surg J (1840)
ADD COMMENT
0
Entering edit mode

Thank you for your answer! After getting familiar with XSLT I will try to use it for a more complex schema.

ADD REPLY
0
Entering edit mode

just FYI nodalpoint is archived :) EDIT: I see Pierre found the archive too.

ADD REPLY
0
Entering edit mode
11.0 years ago

You can also give a try to BioGyan (http://www.biogyan.com/). It is a comprehensive search tool specially designed for biologists, enabling search, annotation and ranking of scientific literature from public databases.Further you can export your result in excel and that can be imported into the RDMS which you intend to.

ADD COMMENT

Login before adding your answer.

Traffic: 1032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6