XML from entrez not formatted correctly
2
0
Entering edit mode
5.1 years ago
noodle ▴ 590

Hi all, I'm hoping someone might advise on the XML text output from rentrez following the below workflow. It seems that the output does not open/close all tags properly and I'm not sure how I can clean this up. For example, in the below the open tag "Platform" isn't annotated > and therefore I can't gsub > to make it workable/readable.

#retrieve data from SRR
r_search <- entrez_search(db="sra", term="SRR10025068")
r_search.id <- r_search$ids
all_the_links <- entrez_link(dbfrom='sra', id=r_search.id, db='all')
r_summ <- entrez_summary(db="sra", id=all_the_links$links$sra_bioproject_all)
xml.data.dirty <- r_summ$expxml
xml.data.dirty
[1] "  &lt;Summary&gt;&lt;Title&gt;Mouse 57&lt;/Title&gt;&lt;Platform instrument_model=\"454 GS FLX Titanium\"&gt;LS454&lt;/Platform&gt;&lt;Statistics total_runs=\"1\" total_spots=\"6058\" total_bases=\"2449911\" total_size=\"1638287\" load_done=\"true\" cluster_name=\"public\"/&gt;&lt;/Summary&gt;&lt;Submitter acc=\"SRA115778\" center_name=\"Texas A&amp;amp;M University\" contact_name=\"Sean McCaffrey\" lab_name=\"Gastrointestinal Laboratory\"/&gt;&lt;Experiment acc=\"SRX390677\" ver=\"1\" status=\"public\" name=\"Mouse 57\"/&gt;&lt;Study acc=\"SRP033709\" name=\"Mice gut bacteria Targeted Locus (Loci)\"/&gt;&lt;Organism taxid=\"10090\" ScientificName=\"Mus musculus\"/&gt;&lt;Sample acc=\"SRS514105\" name=\"\"/&gt;&lt;Instrument LS454=\"454 GS FLX Titanium\"/&gt;&lt;Library_descriptor&gt;&lt;LIBRARY_NAME/&gt;&lt;LIBRARY_STRATEGY&gt;AMPLICON&lt;/LIBRARY_STRATEGY&gt;&lt;LIBRARY_SOURCE&gt;GENOMIC&lt;/LIBRARY_SOURCE&gt;&lt;LIBRARY_SELECTION&gt;unspecified&lt;/LIBRARY_SELECTION&gt;&lt;LIBRARY_LAYOUT&gt;                 &lt;SINGLE/&gt;               &lt;/LIBRARY_LAYOUT&gt;&lt;/Library_descriptor&gt;&lt;Bioproject&gt;PRJNA231086&lt;/Bioproject&gt;&lt;Biosample&gt;SAMN02440270&lt;/Biosample&gt;  "

#get usable XML file
xml.data.5knwn <- gsub("&gt;", ">", xml.data.dirty)
xml.data.5knwn <- gsub("&lt;", "<", xml.data.5knwn)
xml.data.5knwn <- gsub("&amp;", "&", xml.data.5knwn)
xml.data.5knwn <- gsub("&apos;", "'", xml.data.5knwn)
xml.data.5knwn <- gsub("&quot;", '"', xml.data.5knwn)
xml.data.5knwn.clean <- gsub(" ", "", xml.data.5knwn)
xml.data.5knwn.clean
[1] "<Summary><Title>Mouse57</Title><Platforminstrument_model=\"454GSFLXTitanium\">LS454</Platform><Statisticstotal_runs=\"1\"total_spots=\"6058\"total_bases=\"2449911\"total_size=\"1638287\"load_done=\"true\"cluster_name=\"public\"/></Summary><Submitteracc=\"SRA115778\"center_name=\"TexasA&amp;MUniversity\"contact_name=\"SeanMcCaffrey\"lab_name=\"GastrointestinalLaboratory\"/><Experimentacc=\"SRX390677\"ver=\"1\"status=\"public\"name=\"Mouse57\"/><Studyacc=\"SRP033709\"name=\"MicegutbacteriaTargetedLocus(Loci)\"/><Organismtaxid=\"10090\"ScientificName=\"Musmusculus\"/><Sampleacc=\"SRS514105\"name=\"\"/><InstrumentLS454=\"454GSFLXTitanium\"/><Library_descriptor><LIBRARY_NAME/><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT><SINGLE/></LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA231086</Bioproject><Biosample>SAMN02440270</Biosample>"

Edit: typo

rentrez rXML entrez R XML • 2.0k views
ADD COMMENT
0
Entering edit mode

I am not sure what exactly you need to parse from this dataset but this looks clean enough.

$ efetch -db sra -id "SRR10025068" -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR10025068,2019-08-27 18:44:15,2019-08-27 16:31:34,12479737,3768880574,12479737,302,1154,,https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068,SRX6762113,1168003_D6_S,WGS,RANDOM,METAGENOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP219390,PRJNA561398,,561398,SRS5310604,SAMN12617402,simple,408170,human gut metagenome,30185,,,,,,,no,,,,,YALE SCHOOL OF PUBLIC HEALTH,SRA948009,,public,E799243BAFB62132C20AC9F550F70206,052064FDF091B79E9DA48242EF5F98A2
ADD REPLY
0
Entering edit mode

Thanks, I tried the entrez e-utils as well, but the data returned between the two functions is similar but different, and unfortunately I'm looking at the different stuff.

ADD REPLY
0
Entering edit mode

Wow, so I just realized that the issue isn't the XML, it's also the data returned is incorrect - a much bigger issue.

ADD REPLY
0
Entering edit mode

it's also the data returned is incorrect

Could you please explain what data are incorrect?

ADD REPLY
0
Entering edit mode

I think this bit in the original post does not match information I obtained when using runinfo

<Summary><Title>Mouse 57</Title><Platform instrument_model=\"454 GS FLX Titanium\">LS454</Platform><Statistics total_runs=\"1\" total_spots=\"6058\" total_bases=\"2449911\" total_size=\"1638287\" load_done=\"true\" cluster_name=\"public\"/></Summary><Submitter acc=\"SRA115778\" center_name=\"Texas A&amp;M University\" contact_name=\"Sean McCaffrey\" lab_name=\"Gastrointestinal Laboratory\"/><Experiment acc=\"SRX390677\" ver=\"1\" status=\"public\" name=\"Mouse 57\"/><Study acc=\"SRP033709\" name=\"Mice gut bacteria Targeted Locus (Loci)\"/><Organism taxid=\"10090\" ScientificName=\"Mus musculus\"/><Sample acc=\"SRS514105\" name=\"\"/><Instrument LS454=\"454 GS FLX Titanium\"/><Library_descriptor><LIBRARY_NAME/><LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>unspecified</LIBRARY_SELECTION><LIBRARY_LAYOUT> <SINGLE/>
</LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA231086</Bioproject><Biosample>SAMN02440270</Biosample> "

ADD REPLY
0
Entering edit mode

I think this bit in the original post does not match information I obtained when using runinfo

Is that not expected? Your command is fetching the runinfo table whereas @joe was downloading the Bioproject docsum. The corresponding edirect command for what @joe was doing:

esearch -db sra -query 'SRR10025068' \
  | elink -db sra -target bioproject -name sra_bioproject_all \
  | esummary

The XML of the command shown above is still not in the best XML format but it can be cleaned up by piping the output to xtract -format.

If I understand this correctly, the issue @joe has is related to encoding of html characters in the r_summ$expxml object, not the data itself.

ADD REPLY
0
Entering edit mode

Ah I see. I only looked at the accession OP was using and looked up the runinfo. That is a NovaSeq 6000 run.

If we look at the bioproject SRR10025068 belongs (as far as I can see from this SRA page) to where is the reference to 454 coming from from the output OP has?

ADD REPLY
0
Entering edit mode

Good eyes! It was my (and the OP's) mistake. You see we both used -target sra for our target db in the elink. So, the data that was being fetched was for the identifier 561398 from SRA instead of BioProject. I now fixed my command to use -target bioproject to get the correct data out.

ADD REPLY
0
Entering edit mode

My (original) issue was that the xml output was not correctly formatted, and I later realized the data returned was not correct.

ADD REPLY
2
Entering edit mode
5.1 years ago
vkkodali_ncbi ★ 3.8k

This appears to be unnecessarily complicated to me. For a given list of SRA accessions, you should be able to just download the comma-separated runinfo table from the command line (without going through R) and then parse the output file as a CSV from within R. Do you need to do everything from within R? If you do need a parsable XML from within R, you can do the following:

> r1 <- entrez_fetch(db='sra', id='SRR10025068', rettype='runinfo', retmode='xml', parsed=TRUE)
> r1
[1] "\n<SraRunInfo>\n<Row>\n<Run>SRR10025068</Run>\n<ReleaseDate>2019-08-27 18:44:15</ReleaseDate>\n<LoadDate>2019-08-27 16:31:34</LoadDate>\n<spots>12479737</spots>\n<bases>3768880574</bases>\n<spots_with_mates>12479737</spots_with_mates>\n<avgLength>302</avgLength>\n<size_MB>1154</size_MB>\n<download_path>https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068</download_path>\n<Experiment>SRX6762113</Experiment>\n<LibraryName>1168003_D6_S</LibraryName>\n<LibraryStrategy>WGS</LibraryStrategy>\n<LibrarySelection>RANDOM</LibrarySelection>\n<LibrarySource>METAGENOMIC</LibrarySource>\n<LibraryLayout>PAIRED</LibraryLayout>\n<InsertSize>0</InsertSize>\n<InsertDev>0</InsertDev>\n<Platform>ILLUMINA</Platform>\n<Model>Illumina NovaSeq 6000</Model>\n<SRAStudy>SRP219390</SRAStudy>\n<BioProject>PRJNA561398</BioProject>\n<ProjectID>561398</ProjectID>\n<Sample>SRS5310604</Sample>\n<BioSample>SAMN12617402</BioSample>\n<SampleType>simple</SampleType>\n<TaxID>408170</TaxID>\n<ScientificName>human gut metagenome</ScientificName>\n<SampleName>30185</SampleName>\n<Tumor>no</Tumor>\n<CenterName>YALE SCHOOL OF PUBLIC HEALTH</CenterName>\n<Submission>SRA948009</Submission>\n<Consent>public</Consent>\n<RunHash>E799243BAFB62132C20AC9F550F70206</RunHash>\n<ReadHash>052064FDF091B79E9DA48242EF5F98A2</ReadHash>\n</Row>\n\n</SraRunInfo>\n"
ADD COMMENT
0
Entering edit mode

Nice! ...next time. Now I'm more familiar with rentrez. I was calling this as part of a bigger function on a list of a few hundred SRA accessions, so yes, in the case I needed (wanted) to work from R.

ADD REPLY
1
Entering edit mode

joe : I moved @vkkodali's comment to an answer since it seems to do what you need efficiently. Feel free to accept that (and your own) answer to provide closure to this thread.

ADD REPLY
0
Entering edit mode

I'll just comment that the original issue of the incorrectly formatted XML was not addressed.

ADD REPLY
0
Entering edit mode
5.1 years ago
noodle ▴ 590

Thanks everyone for the responses. In the end I did the below, not exactly what I wanted but I got by...

this.runID <- "SRR10025068"
#
entrez.cmd <- paste0("esearch -db 'sra' -query '", this.runID,"' | esummary  -db 'all' -format runinfo")
entrez.cmd
[1] "esearch -db 'sra' -query 'SRR10025068' | esummary  -db 'all' -format runinfo"
#
entrez.intern <- system(entrez.cmd, intern=TRUE)
entrez.intern
[1] "Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash"
[2] "SRR10025068,2019-08-27 18:44:15,2019-08-27 16:31:34,12479737,3768880574,12479737,302,1154,,https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068,SRX6762113,1168003_D6_S,WGS,RANDOM,METAGENOMIC,PAIRED,0,0,ILLUMINA,Illumina NovaSeq 6000,SRP219390,PRJNA561398,,561398,SRS5310604,SAMN12617402,simple,408170,human gut metagenome,30185,,,,,,,no,,,,,YALE SCHOOL OF PUBLIC HEALTH,SRA948009,,public,E799243BAFB62132C20AC9F550F70206,052064FDF091B79E9DA48242EF5F98A2"                                             
[3] ""
#
entrez.colnames <- unlist(strsplit(entrez.intern[1], ","))
entrez.data <- unlist(strsplit(entrez.intern[2], ","))
this.entrez.data <- t(data.frame(entrez.data))
colnames(this.entrez.data) <- as.character(entrez.colnames)
rownames(this.entrez.data) <- this.runID
this.entrez.data
            Run           ReleaseDate           LoadDate              spots      bases        spots_with_mates avgLength
SRR10025068 "SRR10025068" "2019-08-27 18:44:15" "2019-08-27 16:31:34" "12479737" "3768880574" "12479737"       "302"    
            size_MB AssemblyName download_path                                                               Experiment  
SRR10025068 "1154"  ""           "https://sra-download.ncbi.nlm.nih.gov/traces/sra25/SRR/009790/SRR10025068" "SRX6762113"
            LibraryName    LibraryStrategy LibrarySelection LibrarySource LibraryLayout InsertSize InsertDev Platform  
SRR10025068 "1168003_D6_S" "WGS"           "RANDOM"         "METAGENOMIC" "PAIRED"      "0"        "0"       "ILLUMINA"
            Model                   SRAStudy    BioProject    Study_Pubmed_id ProjectID Sample       BioSample     
SRR10025068 "Illumina NovaSeq 6000" "SRP219390" "PRJNA561398" ""              "561398"  "SRS5310604" "SAMN12617402"
            SampleType TaxID    ScientificName         SampleName g1k_pop_code source g1k_analysis_group Subject_ID Sex
SRR10025068 "simple"   "408170" "human gut metagenome" "30185"    ""           ""     ""                 ""         "" 
            Disease Tumor Affection_Status Analyte_Type Histological_Type Body_Site CenterName                    
SRR10025068 ""      "no"  ""               ""           ""                ""        "YALE SCHOOL OF PUBLIC HEALTH"
            Submission  dbgap_study_accession Consent  RunHash                           
SRR10025068 "SRA948009" ""                    "public" "E799243BAFB62132C20AC9F550F70206"
            ReadHash                          
SRR10025068 "052064FDF091B79E9DA48242EF5F98A2"
ADD COMMENT

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6