The title explains it all. I want to create a taxonomy file as described in the QIIME file types overview but I have over 400 sequences and each one of them needs the associated taxonomic information. I could do it manually but figured it might take a very long time. Can someone please help?
Basically I an accession list for all the sequences I am interested in. They are all bacterial species. I need to be able to add the taxonomic information for each accession number.
If I recall correctly, QIIME expects taxonomy map files in "ID<tab>Lineage" format. So for example if I have a file called GILIST, and have Entrez Direct in $PATH
cat GILIST
807531833
214010441
I could do for example do:
for next in $(cat GILIST); do \
LINEAGE=$(efetch -db nuccore -id $next -format gpc | xtract.pl -element INSDSeq_taxonomy); \
echo -e "$next\t$LINEAGE"; \
done
It's possible to post up to 500 (I think?) IDs with epost and then pipe to efetch. Be aware though that input and output order may not match. However, utilizing the UNIX tool sort on input could maybe fix this. Otherwise, it might be possible to also parse the ID from the output simultaneously with lineage. The script would be a lot more complicated and utilize many more UNIX tools like like e.g. split (to split your input file into chunks of 500 entries), sed (to format the 500 lines to a comma separated list for epost) and possibly paste (to join input and output for the map file format).
Ok I figured it out, I needed to add a pattern argument in the script and change some parameters.
Thus, my script now is:
'for next in $(cat /home/marianoavino/Desktop/tax_reportexample18.csv);
do
LINEAGE=$(efetch -db taxonomy -id $next -format xml | xtract -pattern Taxon -element TaxId ScientificName);
echo -e "$next\t$LINEAGE";
done'
what type of taxonomic information are you looking for, NCBI? You could use blast perhaps as Peter Cock describes it here:
http://blastedbio.blogspot.com/2012/05/blast-tabular-missing-descriptions.html
(see taxonomy info towards the bottom)
Basically I an accession list for all the sequences I am interested in. They are all bacterial species. I need to be able to add the taxonomic information for each accession number.
It would look this:
Very interesting! I will have to check this out.