Question

How to get summary for acc.no. not starting with 'WP_' ?

0

Entering edit mode

4.7 years ago

6schulte ▴ 30

Hi Biostars community,

I want to use epost and esummary (NCBIs eutils) to obtain information on the lineage.

But I have some problems with accession numbers not starting like WP_.

While

cat "$ListWithAccessionNumbers" | epost -db protein |\
    esummary -db taxonomy -format xml | \
    xtract  -pattern Seq-entry -element Org-ref_taxname, OrgName_lineage, NCBIeaa, Textseq-id_accession \
    > SummaryTable.tsv

gives me a tsv file indeed, some cells are not filled with the requested information.

For the accession numbers not starting with WP_ the accession number and sequence are not printed out, this will only be printed for the accession numbers starting with WP_.

So my current question is how can I obtain lineage information for those accession numbers using epost and esummary that do not start with WP_ but still also get the accession number and sequence printed out? Is there anyone with experience regarding this?

If though you have some suggestions on how to use epost and esummary differently instead or know of an alternative way of using the NCBIs e-utilities to solve this problem, I am grateful for your ideas and help! Thank you!

esummary epost lineage summary E-utility • 1.5k views

ADD COMMENT • link updated 4.7 years ago by Ram 45k • written 4.7 years ago by 6schulte ▴ 30

1

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY • link 4.7 years ago by Ram 45k

0

Entering edit mode

Thank you! I didn't know that, very helpful :)

ADD REPLY • link 4.7 years ago by 6schulte ▴ 30

score 3 · Accepted Answer · 2020-11-18

3

Entering edit mode

4.7 years ago

GenoMax 152k

So my current question is how can I obtain lineage information for those accession numbers using epost and esummary that do not start with WP_ but still also get the accession number and sequence printed out? I

Once an input is passed to Entrezdirect there is no way to keep track of it. So you should print it before it gets passed to Entrezdirect.

Edit: Taking out code which was not tested. Use @vkkodali's solution which is complete.

If that information is missing for specific accessions, you are not going to get that.

ADD COMMENT • link 4.7 years ago by GenoMax 152k

2

Entering edit mode

Just noting that you are not parsing taxonomy XMLs. The output piped from esearch already is using the protein database and esummary ignores -db taxonomy in this instance. If you want to do cross-database searches you will need to use elink. If all you need is taxonomy info (and not sequence) you can do something like:

for i in `cat file_w_accession_one_per_line`; do 
    printf ${i}"\t"; \
    epost -db protein -id ${i} -format acc \
    | elink -target taxonomy \
    | efetch -format xml \
    | xtract -pattern TaxaSet -first ScientificName -element Lineage ;
done > SummaryTable.tsv

If on the other hand, you want the sequence as well, you can simplify it to:

for i in `cat file_w_accession_one_per_line`; do 
    printf ${i}"\t"; \
    efetch -db protein -id ${i} -format xml \
    | xtract -pattern Seq-entry -element Org-ref_taxname, OrgName_lineage, NCBIeaa, Textseq-id_accession ;
done > SummaryTable.tsv

ADD REPLY • link 4.7 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Thank you for your reply genomax, You have already helped me with my last question. And I am sorry for not giving an example on what acc. no. I am working with... WP_112675856, CCH19814, WP_149273991, RSN10899 would be examples.

And thank you, vkkodali, very much as well! The suggested code works pretty good! :) The code suggested for also retreving the sequence unfortunately does not print the sequence for acc. no. like: CCH19814 and RSN10899. But only for the ones like WP_112675856, and WP_149273991.

Because of that I can extract the sequence with an easy esl-sfetch command and do not neccessaryly need it right now I will just work with the code you suggested as first option. Thank you!

ADD REPLY • link 4.7 years ago by 6schulte ▴ 30