Hi everyone,
I am trying to parse a table file I obtained with hmmscan by comparing a set of fastas against the whole Pfam database of protein domains. I am trying to use HmmerIO in Python to parse the ouput, but it seems to me that, for each fasta, it only keeps the domain with the highest evalue, while I want to keep all of those with a significant evalue.
Example:
Input file:
# --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- ---------------------
GST_N PF02798.17 gi|528981796|ref|NC_021285.1|_5567 - 4.1e-16 58.9 0.0 6.9e-16 58.2 0.0 1.4 1 0 0 1 1 1 1 Glutathione S-transferase, N-terminal domain
GST_N_2 PF13409.3 gi|528981796|ref|NC_021285.1|_5567 - 3.3e-14 52.8 0.0 5.8e-14 52.0 0.0 1.4 1 0 0 1 1 1 1 Glutathione S-transferase, N-terminal domain
GST_N_3 PF13417.3 gi|528981796|ref|NC_021285.1|_5567 - 3.4e-14 52.8 0.0 5.4e-14 52.2 0.0 1.3 1 0 0 1 1 1 1 Glutathione S-transferase, N-terminal domain
GST_C PF00043.22 gi|528981796|ref|NC_021285.1|_5567 - 7e-12 45.3 0.0 9.6e-12 44.8 0.0 1.2 1 0 0 1 1 1 1 Glutathione S-transferase, C-terminal domain
GST_C_2 PF13410.3 gi|528981796|ref|NC_021285.1|_5567 - 6.1e-07 29.2 0.4 9.3e-07 28.6 0.4 1.3 1 0 0 1 1 1 1 Glutathione S-transferase, C-terminal domain
GST_C_3 PF14497.3 gi|528981796|ref|NC_021285.1|_5567 - 0.00011 22.2 0.0 0.00017 21.6 0.0 1.2 1 0 0 1 1 1 1 Glutathione S-transferase, C-terminal domain
Here I found several statistically significant domains for the PF02798.17 gi|528981796|ref|NC_021285.1|_5567 query (fasta sequence), but the SearchIO output I have been able to get looks like this:
for qresult in SearchIO.parse(file1, 'hmmer3-tab'):
print qresult.hits[0]
Query: gi|528981796|ref|NC_021285.1|_5567
<unknown description>
Hit: GST_N
Glutathione S-transferase, N-terminal domain
HSPs: ---- -------- --------- ------ --------------- ---------------------
# E-value Bit score Span Query range Hit range
---- -------- --------- ------ --------------- ---------------------
0 6.9e-16 58.20 ? [None:None] [None:None]
I have no idea of how to access the other hits, even if I am quite sure there has to be a way.
Anyone has any experience with this?
Thanks a lot.
Is there a way I can print it to a file as a tab delimited.
Query, target name, filtered(evalue), qlen ? ##filtered evalue is nothing but i would like output below a specific evalue
Thanks.