I have interproscan output of a new genome annotation and I've used blast2go to look at the GO terms at different levels but I now want to produce a summary table of the number of proteins with interproscan family domains.
Does anyone have a script or method to summarize for instance table 2 from this journal (I'll try to email them to see if they have a script and post if get it): http://www.nature.com/ncomms/2014/140520/ncomms4849/full/ncomms4849.html
I could just collect a subset of interproscan ID's and do a grep for the intreproscan ID's and count them but wondering if there is a more comperehensive sophisticated method to get all those with family interproscan ID's summarized?
UPDATE:
I have downloaded from interproscan their tree relationship file (example given below). The -- are childs of the parent so what I want to do for each parent i.e. IPR015797 sum the number found including the children and sum the children separately.
IPR015797::NUDIX hydrolase domain-like::
--IPR000086::NUDIX hydrolase domain::
----IPR020476::NUDIX hydrolase::
--IPR029119::MutY, C-terminal::
IPR015812::Integrin beta subunit::
--IPR012013::Integrin beta-4 subunit::
--IPR015436::Integrin beta-6 subunit::
--IPR015437::Integrin beta-7 subunit::
--IPR015439::Integrin beta-2 subunit::
--IPR015442::Integrin beta-8 subunit::
--IPR027067::Integrin beta-5 subunit::
--IPR027068::Integrin beta-3 subunit::
--IPR027070::Integrin beta-like protein 1::
--IPR027071::Integrin beta-1 subunit::
Have you had any luck?
Could you possibly tell me where you found the tree relationships of the interproIDs?Never mind, found them here.