I have a list of thousands of fasta sequences of a protein from various species with uniprot IDs as header. Now I want to map the protein sequences to say 50 taxonomic groups by number of occurrences something like this: http://pfam.sanger.ac.uk/family/PF10413#tabview=tab7 or http://smart.embl.de/smart/do_annotation.pl?DOMAIN=SM00357#annoTable
I am wondering if there is any common feature shared by a NCBI TaxID of Archaea/Bacteria so that I can classify them in one group: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics&uncultured=hide&unspecified=hide . I am able to get the species TaxID from uniprot but not sure how to cluster them in few groups.
The boarder goal is to compare several proteins like this in order to find out how they co-evolved. Any suggestions are welcome. Thank you.
Thanks for letting me know about MEGAN. However I am struggling to map the seqid to taxid to see the distribution of sequences in various taxons. I followed the instructions given here: http://ab.inf.uni-tuebingen.de/data/software/megan4/download/welcome.html However I cannot import blast output in the following format or xml format. It fails to import anything.
sp|P60709|ACTBHUMAN gi|45269029|gb|AAS55927.1| 100.00 375 0 0 1 375 30 404 0.0 786 sp|P60709|ACTBHUMAN gi|4501885|ref|NP001092.1| 100.00 375 0 0 1 375 1 375 0.0 785 sp|P60709|ACTBHUMAN gi|62897409|dbj|BAD96645.1| 99.73 375 1 0 1 375 1 375 0.0 785 sp|P60709|ACTB_HUMAN gi|54696726|gb|AAV38735.1| 100.00 375 0 0 1 375 1 375 0.0 785
Does it fail because it does not understand the format or the sequence IDs?