Question

Taxonomic Distribution Of A Protein

1

Entering edit mode

11.7 years ago

Pappu ★ 2.1k

I have a list of thousands of fasta sequences of a protein from various species with uniprot IDs as header. Now I want to map the protein sequences to say 50 taxonomic groups by number of occurrences something like this: http://pfam.sanger.ac.uk/family/PF10413#tabview=tab7 or http://smart.embl.de/smart/do_annotation.pl?DOMAIN=SM00357#annoTable

I am wondering if there is any common feature shared by a NCBI TaxID of Archaea/Bacteria so that I can classify them in one group: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics&uncultured=hide&unspecified=hide . I am able to get the species TaxID from uniprot but not sure how to cluster them in few groups.

The boarder goal is to compare several proteins like this in order to find out how they co-evolved. Any suggestions are welcome. Thank you.

python hmm • 4.3k views

ADD COMMENT • link updated 11.7 years ago by Manu Prestat 4.1k • written 11.7 years ago by Pappu ★ 2.1k

score 1 · Answer 1 · 2013-03-06

1

Entering edit mode

11.7 years ago

Manu Prestat 4.1k

You could find MEGAN very useful to explore your data (but I doubt about its faculty to make donut plots, maybe in the 5th version currently in test). Its main purpose is to make taxonomic classification (using NCBI taxonomic tree by default) from blast (like) similarity search of metagenomic data output, but it has the option to just provide a seqID - taxID mapping file. Then you can play with the very convenient tree browser and collapse whatever group you want to, and make some "abundance" plots (histograms but not only) from any branch/level in the tree you select. Last but not least, you can do all that with several conditions in the same time.

ADD COMMENT • link 11.7 years ago by Manu Prestat 4.1k

0

Entering edit mode

Thanks for letting me know about MEGAN. However I am struggling to map the seqid to taxid to see the distribution of sequences in various taxons. I followed the instructions given here: http://ab.inf.uni-tuebingen.de/data/software/megan4/download/welcome.html However I cannot import blast output in the following format or xml format. It fails to import anything.

sp|P60709|ACTBHUMAN gi|45269029|gb|AAS55927.1| 100.00 375 0 0 1 375 30 404 0.0 786 sp|P60709|ACTBHUMAN gi|4501885|ref|NP001092.1| 100.00 375 0 0 1 375 1 375 0.0 785 sp|P60709|ACTBHUMAN gi|62897409|dbj|BAD96645.1| 99.73 375 1 0 1 375 1 375 0.0 785 sp|P60709|ACTB_HUMAN gi|54696726|gb|AAV38735.1| 100.00 375 0 0 1 375 1 375 0.0 785

ADD REPLY • link 11.7 years ago by Pappu ★ 2.1k

0

Entering edit mode

Does it fail because it does not understand the format or the sequence IDs?

ADD REPLY • link 11.7 years ago by Manu Prestat 4.1k

score 0 · Answer 2 · 2013-03-06

Counting the number of protein per species can be easy : for each protein id uniprot, request uniprot for species and store in an array whose keys are species and values are a list of uniprot id. Then at the end, for each item of the array, count the number of values.

Python would be ok for that job (see stackoverflow for example)

And i would add this uniprot faq. Look at the bottom for example in python, perl, ruby and java.

score 0 · Answer 3 · 2013-03-06

For the second part I would recommend using BayesTraits Written by Mark Pagel and Andrew Meade. The first part is very tricky and depends on the question you are trying to answer. You can on one hand search for the homologs of the protein which is not such a simple task (An example for yeast can be found here). You can on the other hand ask how many proteins that contain the same domain are in each species, you can get this information from Pfam for example.

score 0 · Answer 4 · 2013-03-06

Here's how to search NCBI for the organism information associated with a protein:

Get GI:

http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=Q6UN29

It returns an XML file, in which the 82238374 is the GI of this protein.

Then, get the summary information of this protein:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=82238374

It says the taxonomy ID is 71168.

Then look up this ID:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=taxonomy&id=71168

It tells you that the organism name is Geotria australis, and it's a lampreys.

These steps can be easily automated in python.