EDirect xtract syntax
1
0
Entering edit mode
7.0 years ago
pedrorvc ▴ 30

I'm trying to use EDirect to get the BioSample IDs, the associated BioProject IDs and SRA IDs andI'm having a problem with the output file.

The XML file i'm using is like this:

</BioSample><BioSample submission_date="2009-11-25T14:28:04.407" access="public" last_update="2015-01-29T02:30:16.280" publication_date="2009-11-25T14:28:09.680" id="5124" accession="SAMN00005124">
  <Ids>
    <Id is_primary="1" db="BioSample">SAMN00005124</Id>
    <Id db="SRA">SRS007221</Id>
    <Id db="GEO">GSM451804</Id>
  </Ids>
  <Description>
    <Title>AdultMale_combined_RNAseq_1, 2</Title>
    <Organism taxonomy_id="7227" taxonomy_name="Drosophila melanogaster">
      <OrganismName>Drosophila melanogaster</OrganismName>
    </Organism>
  </Description>
  <Owner>
    <Name>Institute for Genomics and Systems Biology, University of Chicago</Name>
    <Contacts>
      <Contact email="kpwhite@uchicago.edu">
        <Name>
          <First>Kevin</First>
          <Last>White</Last>
        </Name>
      </Contact>
    </Contacts>
  </Owner>
  <Models>
    <Model>Generic</Model>
  </Models>
  <Package display_name="Generic">Generic.1.0</Package>
  <Attributes>
    <Attribute attribute_name="source_name" display_name="source name" harmonized_name="source_name">AdultMale</Attribute>
    <Attribute attribute_name="development stage" display_name="development stage" harmonized_name="dev_stage">AdultMale</Attribute>
  </Attributes>
  <Links>
    <Link label="GEO Sample GSM451804" type="url">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM451804</Link>
    <Link label="PRJNA116485" type="entrez" target="bioproject">116485</Link>
    <Link label="PRJNA168994" type="entrez" target="bioproject">168994</Link>
    <Link label="PRJNA63467" type="entrez" target="bioproject">63467</Link>
  </Links>

The EDirect line that I'm using is the following:

xtract -input out001.xml -pattern BioSample \
-block Ids -first Id \
-block Id -if Id@db -equals "SRA" -element Id \
-block Link -if Link@target -equals "bioproject" -tab "," -element Link > xtract2_out.txt

Which outputs to something like this:

SAMN00014503    SRS074435   128909,129179
SAMN00014655    SRS074533   127185
SAMN00014812    129305,127109

The last line is where the problem is. When the SRA ID doesn't exist the columns get pushed to the left.

After consulting the documentation I tried using the -def command to put a placeholder for missing values but it didn't work no matter where i put it. Also tried to use the -else with the -lbl argument and didn't work too.

Can anyone help with syntax?

Thanks!

xtract EDirect • 2.9k views
ADD COMMENT
2
Entering edit mode
7.0 years ago
pedrorvc ▴ 30

So I got in touch with the NCBI staff and together we came up with this answer:

xtract -input out001.xml -pattern BioSample -SRA "(-)" \
-block Id -if Id@db -equals "SRA" -SRA Id \
-block Ids -first Id -element "&SRA" \
-block Link -if Link@target -equals "bioproject" -tab "," -element Link

The output is now like I wanted it to be.

SAMN00014503    SRS074435   128909,129179
SAMN00014655    SRS074533   127185
SAMN00014812    -   129305,127109
SAMN00031920    -
SAMN00032070    -
SAMN00032222    -
SAMN00032375    -
ADD COMMENT
1
Entering edit mode

Any clue on how to include placeholders for the bioproject column? I would like to add 4+ columns and have been hitting my head against a wall trying to figure out (1) why the -def argument doesn't work and (2) a way around this by using something similar to your code above.

ADD REPLY

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6