Given a protein family sequence alignment from PFAM, I want to get taxonomy information for each of the sequences. For example, for each sequence, I want to know whether it is eukaryote or prokaryote. How can I do this, in Python, Bash or other scriptable tool?
I've been inspecting the
database_files
contents, but I'm not sure how to use them. Any suggestions on what I can try?If you get this taxonomy file from the
database_files
directory then it seems to contain information in this formatNumber in the first column is NCBI
taxID
, second column has the name and next column has the phylogeny. Not clear how to relate this back to PFAM. @Mensur may have an idea.