Help extracting the taxonomy for a file containing a list of accession numbers from Genbank
2
1
Entering edit mode
8.7 years ago
Longshotx ▴ 70

The title explains it all. I want to create a taxonomy file as described in the QIIME file types overview but I have over 400 sequences and each one of them needs the associated taxonomic information. I could do it manually but figured it might take a very long time. Can someone please help?

sequence gene taxonomy parsing • 4.5k views
ADD COMMENT
0
Entering edit mode

what type of taxonomic information are you looking for, NCBI? You could use blast perhaps as Peter Cock describes it here:

http://blastedbio.blogspot.com/2012/05/blast-tabular-missing-descriptions.html

(see taxonomy info towards the bottom)

ADD REPLY
0
Entering edit mode

Basically I an accession list for all the sequences I am interested in. They are all bacterial species. I need to be able to add the taxonomic information for each accession number.

It would look this:

EF175100.1 Bacteria;Proteobacteria;Betaproteobacteria;Nitrosomonadales;Nitrosomonadaceae;Nitrosospira EF175099.1 Bacteria;Proteobacteria;Betaproteobacteria;Nitrosomonadales;Nitrosomonadaceae;Nitrosospira EF175098.1 Bacteria;Proteobacteria;Betaproteobacteria;Nitrosomonadales;Nitrosomonadaceae;Nitrosospira EF175097.1 Bacteria;Proteobacteria;Betaproteobacteria;Nitrosomonadales;Nitrosomonadaceae;Nitrosospira AF509003.1 Bacteria;Proteobacteria;Gammaproteobacteria;Chromatiales;Chromatiaceae;Nitrosococcus

Very interesting! I will have to check this out.

ADD REPLY
1
Entering edit mode
8.7 years ago
5heikki 11k

If I recall correctly, QIIME expects taxonomy map files in "ID<tab>Lineage" format. So for example if I have a file called GILIST, and have Entrez Direct in $PATH

cat GILIST 
807531833
214010441

I could do for example do:

for next in $(cat GILIST); do \
LINEAGE=$(efetch -db nuccore -id $next -format gpc | xtract.pl -element INSDSeq_taxonomy); \
echo -e "$next\t$LINEAGE"; \
done

And output would be:

807531833   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archelosauria; Archosauria; Dinosauria; Saurischia; Theropoda; Coelurosauria; Aves; Neognathae; Piciformes; Picidae; Campethera
214010441   Eukaryota; Metazoa; Cnidaria; Anthozoa; Hexacorallia; Scleractinia; Astrocoeniina; Pocilloporidae; Pocillopora
ADD COMMENT
0
Entering edit mode

How would I perform this in terminal on say a Mac? Do I need to install a package?

Thanks!! And that's exactly what I need.

EDIT: Nevermind I think I figured it out!!

ADD REPLY
0
Entering edit mode

5heikki I was able to install Entrez successfully and run your first command. However I input the following command in terminal:

for next in $(cat sequence.gi.txt); do \
LINEAGE=$(efetch -db nuccore -id $next -format gpc | xtract.pl -element INSDSeq_taxonomy); \
echo -e "$next\t$LINEAGE"; \
done

and terminal returned ">" with no output:

I'm not sure what to do at this point. Please help!

ADD REPLY
1
Entering edit mode

Make sure you enter it correctly, you can also put it into a separate file script.sh, it is easier to deal with it like that

for next in $(cat sequence.gi.txt); 
do
    LINEAGE=$(efetch -db nuccore -id $next -format gpc | xtract.pl -element INSDSeq_taxonomy); 
    echo -e "$next\t$LINEAGE";
done

then run it as

bash script.sh
ADD REPLY
0
Entering edit mode

I feel like I am really close but I got the error:

script.sh: line 3: efetch: command not found script.sh: line 3
xtract.pl: command not found

cat command works fine because it pulls up all the Genbank IDs. Any thoughts on this? Also thanks again.

Edit: I re-installed Entrez direct and it worked beautifully!!!

One last question thought is there any way to speed this process up?

ADD REPLY
0
Entering edit mode

It's possible to post up to 500 (I think?) IDs with epost and then pipe to efetch. Be aware though that input and output order may not match. However, utilizing the UNIX tool sort on input could maybe fix this. Otherwise, it might be possible to also parse the ID from the output simultaneously with lineage. The script would be a lot more complicated and utilize many more UNIX tools like like e.g. split (to split your input file into chunks of 500 entries), sed (to format the 500 lines to a comma separated list for epost) and possibly paste (to join input and output for the map file format).

ADD REPLY
0
Entering edit mode

Istvan that worked great. I am now trying to save the output as a .txt file.

I have tried ammending the script with the following line in text wrangler:

./myscript.sh > testfile.txt

But it doesn't work. What am I doing wrong? Thanks

ADD REPLY
1
Entering edit mode

you don't need to edit this script to put the output into a different file, just run it as

bash script.sh > output.txt
ADD REPLY
0
Entering edit mode

Istvan thank you so much. I think its really time I learned how to script. Could you recommend a guide?

ADD REPLY
0
Entering edit mode
7.7 years ago
mavino ▴ 10

Hi All, I am new to Bash and Entrez,

this my Bash script:

'for next in $(tax_reportexample18.csv); do LINEAGE=$(efetch -db nuccore -id $next -format gpc | xtract -element INSDSeq_taxonomy); echo -e "$next\t$LINEAGE"; done'

and csv file looks like:

'9995

9986'

Then I am running bash script.sh > output.txt and I keep getting this message in the command line:

'ERROR: No -pattern in command-line arguments 9995

ERROR: No -pattern in command-line arguments 9986'

and my output file looks like the input file.

Any Idea? Thank you very much

ps I had changed in the script 'xtract.pl' with 'xtract' because it was requested by Entrez

ADD COMMENT
0
Entering edit mode

Ok I figured it out, I needed to add a pattern argument in the script and change some parameters. Thus, my script now is: 'for next in $(cat /home/marianoavino/Desktop/tax_reportexample18.csv); do LINEAGE=$(efetch -db taxonomy -id $next -format xml | xtract -pattern Taxon -element TaxId ScientificName); echo -e "$next\t$LINEAGE"; done'

and it works.

ADD REPLY

Login before adding your answer.

Traffic: 2564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6