Hi all, I'm hoping someone might advise on the XML text output from rentrez
following the below workflow. It seems that the output does not open/close all tags properly and I'm not sure how I can clean this up. For example, in the below the open tag "Platform" isn't annotated >
and therefore I can't gsub
> to make it workable/readable.
#retrieve data from SRR
r_search <- entrez_search(db="sra", term="SRR10025068")
r_search.id <- r_search$ids
all_the_links <- entrez_link(dbfrom='sra', id=r_search.id, db='all')
r_summ <- entrez_summary(db="sra", id=all_the_links$links$sra_bioproject_all)
xml.data.dirty <- r_summ$expxml
xml.data.dirty
[1] " <Summary><Title>Mouse 57</Title><Platform instrument_model=\"454 GS FLX Titanium\">LS454</Platform><Statistics total_runs=\"1\" total_spots=\"6058\" total_bases=\"2449911\" total_size=\"1638287\" load_done=\"true\" cluster_name=\"public\"/></Summary><Submitter acc=\"SRA115778\" center_name=\"Texas A&amp;M University\" contact_name=\"Sean McCaffrey\" lab_name=\"Gastrointestinal Laboratory\"/><Experiment acc=\"SRX390677\" ver=\"1\" status=\"public\" name=\"Mouse 57\"/><Study acc=\"SRP033709\" name=\"Mice gut bacteria Targeted Locus (Loci)\"/><Organism taxid=\"10090\" ScientificName=\"Mus musculus\"/><Sample acc=\"SRS514105\" name=\"\"/><Instrument LS454=\"454 GS FLX Titanium\"/><Library_descriptor><LIBRARY_NAME/><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA231086</Bioproject><Biosample>SAMN02440270</Biosample> "
#get usable XML file
xml.data.5knwn <- gsub(">", ">", xml.data.dirty)
xml.data.5knwn <- gsub("<", "<", xml.data.5knwn)
xml.data.5knwn <- gsub("&", "&", xml.data.5knwn)
xml.data.5knwn <- gsub("'", "'", xml.data.5knwn)
xml.data.5knwn <- gsub(""", '"', xml.data.5knwn)
xml.data.5knwn.clean <- gsub(" ", "", xml.data.5knwn)
xml.data.5knwn.clean
[1] "<Summary><Title>Mouse57</Title><Platforminstrument_model=\"454GSFLXTitanium\">LS454</Platform><Statisticstotal_runs=\"1\"total_spots=\"6058\"total_bases=\"2449911\"total_size=\"1638287\"load_done=\"true\"cluster_name=\"public\"/></Summary><Submitteracc=\"SRA115778\"center_name=\"TexasA&MUniversity\"contact_name=\"SeanMcCaffrey\"lab_name=\"GastrointestinalLaboratory\"/><Experimentacc=\"SRX390677\"ver=\"1\"status=\"public\"name=\"Mouse57\"/><Studyacc=\"SRP033709\"name=\"MicegutbacteriaTargetedLocus(Loci)\"/><Organismtaxid=\"10090\"ScientificName=\"Musmusculus\"/><Sampleacc=\"SRS514105\"name=\"\"/><InstrumentLS454=\"454GSFLXTitanium\"/><Library_descriptor><LIBRARY_NAME/><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT><SINGLE/></LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA231086</Bioproject><Biosample>SAMN02440270</Biosample>"
Edit: typo
I am not sure what exactly you need to parse from this dataset but this looks clean enough.
Thanks, I tried the entrez e-utils as well, but the data returned between the two functions is similar but different, and unfortunately I'm looking at the different stuff.
Wow, so I just realized that the issue isn't the XML, it's also the data returned is incorrect - a much bigger issue.
Could you please explain what data are incorrect?
I think this bit in the original post does not match information I obtained when using
runinfo
Is that not expected? Your command is fetching the
runinfo
table whereas @joe was downloading the Bioproject docsum. The corresponding edirect command for what @joe was doing:The XML of the command shown above is still not in the best XML format but it can be cleaned up by piping the output to
xtract -format
.If I understand this correctly, the issue @joe has is related to encoding of html characters in the
r_summ$expxml
object, not the data itself.Ah I see. I only looked at the accession OP was using and looked up the runinfo. That is a NovaSeq 6000 run.
If we look at the bioproject SRR10025068 belongs (as far as I can see from this SRA page) to where is the reference to 454 coming from from the output OP has?
Good eyes! It was my (and the OP's) mistake. You see we both used
-target sra
for our target db in the elink. So, the data that was being fetched was for the identifier561398
from SRA instead of BioProject. I now fixed my command to use-target bioproject
to get the correct data out.My (original) issue was that the xml output was not correctly formatted, and I later realized the data returned was not correct.