Question

Extracting all information about a sample when using xtract from e-utilities

0

Entering edit mode

4.4 years ago

An Ignorant Wanderer • 0

I would like to extract all information about each SAMPLE after running the following query (run the query and add a | grep SAMPLE for clarification on what I mean by SAMPLE):

esearch -db sra -query PRJNA514750 | efetch -format xml

I tried the following: esearch -db sra -query PRJNA514750 | efetch -format xml | xtract -pattern EXPERIMENT -element SAMPLE

but this returns nothing (PS: SAMPLEs are within an EXPERIMENT tag). I read in the e-utilities guide that -pattern will divide the data into rows, and -element into columns, so I'm presuming that this didn't work because SAMPLE has multiple tags within it. So I then tried: esearch -db sra -query PRJNA514750 | efetch -format xml | xtract -pattern SAMPLE -element random_SAMPLE_tag where random_SAMPLE_tag is any tag within SAMPLE.

Here's a concrete example: esearch -db sra -query PRJNA514750 | efetch -format xml | xtract -pattern SAMPLE -element TITLE This works, but I want to get all the information about each SAMPLE, and I do not know beforehand what the tags within it are (I manually got TITLE in this case), and since I want to get this info for a quite a few studies, I can't manually check this.

ncbi e-utilities • 1.9k views

ADD COMMENT • link updated 4.4 years ago by GenoMax 148k • written 4.4 years ago by An Ignorant Wanderer • 0

score 0 · Answer 1 · 2020-08-07

First save the search output into a file:

esearch -db sra -query PRJNA514750 | efetch -format xml > out.xml

that way you don't need to rerun the query. You can the structure of the file with:

cat out.xml | xtract -outline

it prints:

SAMPLE
  IDENTIFIERS
    PRIMARY_ID
    EXTERNAL_ID
    EXTERNAL_ID
  TITLE
  SAMPLE_NAME
    TAXON_ID
    SCIENTIFIC_NAME
  SAMPLE_LINKS
    SAMPLE_LINK
      XREF_LINK
        DB
        ID
        LABEL
  SAMPLE_ATTRIBUTES
    SAMPLE_ATTRIBUTE
      TAG
      VALUE

You can also view the XML file in a browser to see the actual content of the file.

Now xtract has some crazy constructs, see a seemingly never-ending stream of more and more complex examples here https://www.ncbi.nlm.nih.gov/books/NBK179288/

I don't know of a construct that flattens the entire file into text, but as you can imagine that process is not nearly as simple as one might think. There is usually a lot of redundant information that would be useless if full flattened. It is typically better to leave that as XML and just figure out the way to get the fields you need/want with extract when you do need them.

score 0 · Answer 2 · 2020-08-07

Perhaps this would help. I have truncated information to include only two samples here.

$ esearch -db sra -query PRJNA514750 | efetch -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR8435655,2019-01-23 17:34:09,2019-01-11 15:13:41,20246690,1032581190,0,51,455,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-2/SRR8435655/SRR8435655.1,SRX5243190,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP178555,PRJNA514750,3,514750,SRS4245865,SAMN10734300,simple,6239,Caenorhabditis elegans,GSM3560682,,,,,,,no,,,,,GEO,SRA833758,,public,5581E5CC4A0EFFEDADC3BEAE797E0A38,C7501C5F9F0424FB05F81C48477BE7E4
SRR8435656,2019-01-23 17:34:09,2019-01-11 15:14:13,22222562,1133350662,0,51,498,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-15/SRR8435656/SRR8435656.1,SRX5243191,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP178555,PRJNA514750,3,514750,SRS4245866,SAMN10734299,simple,6239,Caenorhabditis elegans,GSM3560683,,,,,,,no,,,,,GEO,SRA833758,,public,E03956EFF39BF1150F5E08C1303BBA4E,65078A40AA70B0F6E4B5142130FA9586