I'm trying to use EDirect to get the BioSample IDs, the associated BioProject IDs and SRA IDs andI'm having a problem with the output file.
The XML file i'm using is like this:
</BioSample><BioSample submission_date="2009-11-25T14:28:04.407" access="public" last_update="2015-01-29T02:30:16.280" publication_date="2009-11-25T14:28:09.680" id="5124" accession="SAMN00005124">
<Ids>
<Id is_primary="1" db="BioSample">SAMN00005124</Id>
<Id db="SRA">SRS007221</Id>
<Id db="GEO">GSM451804</Id>
</Ids>
<Description>
<Title>AdultMale_combined_RNAseq_1, 2</Title>
<Organism taxonomy_id="7227" taxonomy_name="Drosophila melanogaster">
<OrganismName>Drosophila melanogaster</OrganismName>
</Organism>
</Description>
<Owner>
<Name>Institute for Genomics and Systems Biology, University of Chicago</Name>
<Contacts>
<Contact email="kpwhite@uchicago.edu">
<Name>
<First>Kevin</First>
<Last>White</Last>
</Name>
</Contact>
</Contacts>
</Owner>
<Models>
<Model>Generic</Model>
</Models>
<Package display_name="Generic">Generic.1.0</Package>
<Attributes>
<Attribute attribute_name="source_name" display_name="source name" harmonized_name="source_name">AdultMale</Attribute>
<Attribute attribute_name="development stage" display_name="development stage" harmonized_name="dev_stage">AdultMale</Attribute>
</Attributes>
<Links>
<Link label="GEO Sample GSM451804" type="url">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM451804</Link>
<Link label="PRJNA116485" type="entrez" target="bioproject">116485</Link>
<Link label="PRJNA168994" type="entrez" target="bioproject">168994</Link>
<Link label="PRJNA63467" type="entrez" target="bioproject">63467</Link>
</Links>
The EDirect line that I'm using is the following:
xtract -input out001.xml -pattern BioSample \
-block Ids -first Id \
-block Id -if Id@db -equals "SRA" -element Id \
-block Link -if Link@target -equals "bioproject" -tab "," -element Link > xtract2_out.txt
Which outputs to something like this:
SAMN00014503 SRS074435 128909,129179
SAMN00014655 SRS074533 127185
SAMN00014812 129305,127109
The last line is where the problem is. When the SRA ID doesn't exist the columns get pushed to the left.
After consulting the documentation I tried using the -def
command to put a placeholder for missing values but it didn't work no matter where i put it. Also tried to use the -else
with the -lbl
argument and didn't work too.
Can anyone help with syntax?
Thanks!
Any clue on how to include placeholders for the bioproject column? I would like to add 4+ columns and have been hitting my head against a wall trying to figure out (1) why the -def argument doesn't work and (2) a way around this by using something similar to your code above.