If you download the genome directory file from RAST (on the "Job Details" page, select "Genome Directory" as the download format), inside the subsystems folder you will find a file called bindings, which contains information correlating the figfam id with the susbsystem and functional role. You can later correlate the rows from this file with the FigfamID in the FIG000176 format you describe (see below). Here is top line of my genome's bindings file:
TCA_Cycle Citrate synthase (si) (EC 2.3.3.1) fig|666666.4954.peg.421
Gordon Pusch at RAST has been amazingly helpful and detailed when I ask questions; here is my exchange with him asking him to describe the subsystem/bindings file:
I have discovered that assigning the
FIG numbers to a subsystem is not
trivial (i.e. FIG133002). Perhaps I
should be using the information in the
Subsystems>bindings file? Can I ask
you to tell me what is contained in
this file?
It has 1443 lines (my genome has 1664
predicted CDS and 984 are included in
subsystems), and the first column is a
description of a kind of gene category
I think (i.e. TCA-cycle,
glutoredoxins...). Many pegs seem to
be repeated a few times (between 2-6,
from a quick grepping around).
Do you know why there are repeats?
Is there a list somewhere of how the
descriptors in the first column of the
bindings file fit into the subsystems,
without having to use the RAST website
and click for every category?
Gordon's answer: RE: the 'Subsystems/bindings' file:
1.) The first column is the subsystem name;
2.) The second column is the PEg's functional role within the subsystem;
3.) The thirds column is the FIG identifier of the PEG.
The "repeats" occur because subsystems
are allowed to "overlap," i.e., a
given PEG may participate in more than
one subsystem.
The complete table of subsystems and
the functional roles within them may
be downloaded from
<ftp://ftp.nmpdr.org/subsystems/subsys.txt>.
This table contains additional columns
grouping the subsystems into
categories and subcategories.
Another useful piece of info from Gordon:
I have a (somewhat outdated) webpage describing the contents of a
SEED format genome directory as of two
years ago:
<http://microbe.cs.niu.edu/biodocs/Class_Notes/SEED_overview.html>;
skip down to section "Structure of a
Genome Directory."
You can link the FigfamID with the fig|##.peg.# kind of ID, using a file in the Genome Directory called "found" for the genes that have been included in subsystems. There may be other files with the same info, but here is an excerpt from Subsystems>found:
fig|666666.4954.peg.4 FIG133002 Pyridoxamine 5'-phosphate oxidase (EC 1.4.3.5)
fig|666666.4954.peg.7 FIG000635 MG(2+) CHELATASE FAMILY PROTEIN / ComM-related protein
and "proposed_non_ff_functions" for the genes that were not included in subsystems:
fig|666666.4954.peg.1 Autotransporter adhesin
fig|666666.4954.peg.2 hypothetical protein
fig|666666.4954.peg.3 hypothetical protein
fig|666666.4954.peg.5 putative monooxygenase component
I am still a bit stuck on how to handle the fact that there are repeats from the overlapping presence of predicted genes in more than one subsystem, but one good thing if you are comparing multiple genomes annotated in the same way is that they will hopefully have similar overlaps. This may be how I justify comparisons of genomes annotated this way...
Did the original person who asked this question (behind the rabbit) find a solution? Any other ideas for how to handle this?
Katrine Whiteson,
University of Geneva Hospitals,
Genomic Research Lab
Phil and Michael,
thanks a lot for your responses. I guess I'll look into parsing the figfams file.
cheers,
tim
Hi Phil!
could you send me please your code for this mapping task?
Best,
Jhordan Alarcón