The following java program parses a pubmed XML from stdin and prints the difference of days beteen "received" and "accepted":
import java.io.InputStream;
import java.util.GregorianCalendar;
import java.util.concurrent.TimeUnit;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.XMLEvent;
public class Biostar54473
{
private static class PubMedPubDate
{
int year;
int month=-1;
int day=-1;
@Override
public String toString() {
String s=String.format("%04d", year);
if(month!=-1)
{
s+="-"+String.format("%02d", month);
if(day!=-1)
{
s+="-"+String.format("%02d", day);
}
}
return s;
}
long getTimeInMillis()
{
GregorianCalendar cal=new GregorianCalendar(
year,
month==-1?0:month-1,
month==-1 || day==-1?
1:day);
return cal.getTimeInMillis();
}
}
private void parse(InputStream in) throws Exception
{
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.FALSE);
factory.setProperty(XMLInputFactory.IS_VALIDATING, Boolean.FALSE);
factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
XMLEventReader r= factory.createXMLEventReader(in);
String PubStatus=null;
PubMedPubDate curr=null;
PubMedPubDate accepted=null;
PubMedPubDate received=null;
String MedlineTA=null;
String pmid=null;
String ArticleTitle=null;
QName attPubStatus=new QName("PubStatus");
while(r.hasNext())
{
XMLEvent evt=r.nextEvent();
if(evt.isStartElement())
{
String name=evt.asStartElement().getName().getLocalPart();
if(name.equals("PubmedArticle"))
{
pmid=null;
accepted=null;
received=null;
MedlineTA=null;
pmid=null;
ArticleTitle=null;
}
else if(name.equals("ArticleTitle") && ArticleTitle==null)
{
ArticleTitle=r.getElementText().trim();
}
else if(name.equals("PMID") && pmid==null)
{
pmid=r.getElementText().trim();
}
else if(name.equals("MedlineTA") && MedlineTA==null)
{
MedlineTA=r.getElementText().trim();
}
else if(name.equals("PubMedPubDate"))
{
curr=null;
Attribute att=evt.asStartElement().getAttributeByName(attPubStatus);
if(att!=null) PubStatus=att.getValue();
if("received".equals(PubStatus))
{
curr=new PubMedPubDate();
received=curr;
}
else if("accepted".equals(PubStatus))
{
curr=new PubMedPubDate();
accepted=curr;
}
else
{
curr=null;
}
}
else if(curr!=null && name.equals("Year"))
{
try { curr.year=Integer.parseInt(r.getElementText().trim()); } catch(Exception err) { curr=null;received=null;ok=false;}
}
else if(curr!=null && name.equals("Month"))
{
String month=r.getElementText().trim().toLowerCase();
if(month.equals("jan") || month.equals("january")) month="1";
else if(month.equals("feb") || month.equals("february")) month="2";
else if(month.equals("mar") || month.equals("march")) month="3";
else if(month.equals("apr") || month.equals("april")) month="4";
else if(month.equals("may") || month.equals("may")) month="5";
else if(month.equals("jun") || month.equals("june")) month="6";
else if(month.equals("jul") || month.equals("july")) month="7";
else if(month.equals("aug") || month.equals("august")) month="8";
else if(month.equals("sep") || month.equals("september")) month="9";
else if(month.equals("oct") || month.equals("october")) month="10";
else if(month.equals("nov") || month.equals("november")) month="11";
else if(month.equals("dec") || month.equals("december")) month="12";
try { curr.month=Integer.parseInt(month); } catch(Exception err) { curr=null;accepted=null;ok=false;}
}
else if(curr!=null && name.equals("Day"))
{
try { curr.day=Integer.parseInt(r.getElementText().trim()); } catch(Exception err) { curr=null;accepted=null;ok=false;}
}
}
else if(evt.isEndElement())
{
String name=evt.asEndElement().getName().getLocalPart();
if(name.equals("PubmedArticle"))
{
if(received!=null && accepted!=null)
{
long n=accepted.getTimeInMillis()-received.getTimeInMillis();
System.out.println(
pmid+"\t"+
ArticleTitle+"\t"+
MedlineTA+"\t"+
received+"\t"+
accepted+"\t"+
TimeUnit.DAYS.convert(n, TimeUnit.MILLISECONDS)
);
}
ArticleTitle=null;
MedlineTA=null;
pmid=null;
curr=null;
received=null;
accepted=null;
}
else if(name.equals("PubMedPubDate"))
{
curr=null;
}
}
}
}
public static void main(String[] args) throws Exception
{
System.out.println("#pmid\t"+
"ArticleTitle\t"+
"MedlineTA\t"+
"Received\t"+
"Accepted\t"+
"DiffDays"
);
new Biostar54473().parseSystem.in);
}
}
A 'verticalized' example for a few papers containing the word "Next generation Sequencing" in the title. You can read this in R# or whatever to get some stats about a journal, a subject, etc...
$ javac Biostar54473.java && cat pubmed_result.xml | java Biostar54473
>>> 2
$1 #pmid 23020966
$2 ArticleTitle Transcriptome analysis using next-generation sequencing.
$3 MedlineTA Curr Opin Biotechnol
$4 Received 2012-07-04
$5 Accepted 2012-09-04
$6 DiffDays 62
<<< 2
>>> 3
$1 #pmid 23000871
$2 ArticleTitle Understanding pathogens in the era of next generation sequencing.
$3 MedlineTA J Infect Dev Ctries
$4 Received 2012-09-13
$5 Accepted 2012-09-14
$6 DiffDays 1
<<< 3
>>> 4
$1 #pmid 22994565
$2 ArticleTitle Accurate variant detection across non-amplified and whole genome amplified DNA using targeted next generation sequencing.
$3 MedlineTA BMC Genomics
$4 Received 2012-01-30
$5 Accepted 2012-09-20
$6 DiffDays 233
<<< 4
(...)
>>> 253
$1 #pmid 18604217
$2 ArticleTitle Alta-Cyclic: a self-optimizing base caller for next-generation sequencing.
$3 MedlineTA Nat Methods
$4 Received 2008-03-10
$5 Accepted 2008-06-02
$6 DiffDays 83
<<< 253
>>> 254
$1 #pmid 18262675
$2 ArticleTitle The impact of next-generation sequencing technology on genetics.
$3 MedlineTA Trends Genet
$4 Received 2007-11-15
$5 Accepted 2007-12-17
$6 DiffDays 32
<<< 254
I would title this question as "Degree of burden in submitting a paper" :) !
It would be interesting to calculate results per journal and compare to what the publisher claims is turnaround time :)
That's a good point. There are a lot of claims about the speed of the review process made by journals but as far as I know there is no one who checks these facts. Our experience with some journals has certainly deviated a great deal from their claims.
I've played with my java program and uploaded the results on figshare: http://dx.doi.org/10.6084/m9.figshare.96403
Wish I had this when I was trying to calculate the embargo-induced delays in publication of the ENCODE papers http://caseybergman.wordpress.com/2012/09/05/the-cost-to-science-of-the-encode-publication-embargo/
Very useful idea!
This is an issue in the wet-lab world for sure: http://www.nature.com/news/2011/110427/full/472391a.html
I wonder if there is a similar phenomenon among bioinformatics journals. "Please provide tests of extra use cases..." that sort of thing. Anyone had that experience?