I am running hmmscan on a fasta file containing a proteome using Pfam-A.hmm. I would like to parse the output from this scan to collect a list of domains identified in my sequences, as well as list any novel domain architectures which may have been identified. Currently I am parsing the output file using a regex to identify and collect the domains listed above the inclusion threshold. Unfortunately, I am not sure how to go about identifying novel domain architectures. In the end, I'm hoping to be able to produce a few figures listing the percentage of different Pfam-A domains in the proteome, the percentage number of sequences with a single domain, two domains, three domains etc. and the percentage of domain architectures previously seen before and those which are novel.
Any advice would be greatly appreciated!
Thanks, James
I use --tblout instead of stdout from hmmer.