Getting the protein Domain architecture from a domtblout
3
4
Entering edit mode
9.7 years ago
abhijit.synl ▴ 60

Hello Everyone

I have a million protein sequences that I ran against Pfam and now I have the corresponding domtblout files. I would like to get the summary architectures for each of these proteins. Someone has written a script in Bioperl, but I cannot get it to work. Is there a non-bioperl method of extracting these architectures from domtblout. I can use evalue or coverage as cut-off to resolve the domain overlaps.

Thanks

sequence alignment • 5.0k views
ADD COMMENT
0
Entering edit mode

What kind of an output you would like to get from this summary?

ADD REPLY
10
Entering edit mode
9.7 years ago

Make sure you have installed Python.

Save the following code in the parse_domtblout.py file.

import argparse
from itertools import groupby
import re

usage = """%(prog)s reads .domtblout file and returns non-overlapped (or
specified percent of overlapping) domains below given e-value.
"""
p = argparse.ArgumentParser(description=usage)
p.add_argument("-f", "--f", "-file", "--file", dest="file",
                  help=".domtblout file")
p.add_argument("-e", "--e", "--evalue", "-evalue", type=float, dest="evalue",
                   help="E-value cut-off", default=1e-05)
p.add_argument("-o", "--o", "--overlap", "-overlap", type=int, dest="overlap",
             help="overlap cut-off (percent)", default=40)

args = p.parse_args()

fh = open(args.file)
oh = open(args.file + '.out', 'w')

grouper = lambda x: re.split('\s+', x)[3] if not x.startswith('#') else x[0]
# Iterate through proteins.
for k, g in groupby(fh, grouper):
    if k.startswith('#'): continue  # Skip header and footer of a domtblout file.
    l = []
    for line in g:
        sl = re.split('\s+', line)
        pid = sl[3]
        dname = sl[0]
        did = sl[1]
        dstart = int(sl[17])
        dend = int(sl[18])
        evalue = float(sl[12])
        if evalue <= args.evalue:
             l.append([dname, did, dstart, dend, evalue])

    # Filter domains by a given E-value and overlapping cut-off.
    filtered = []
    l.sort(key=lambda x: x[2])
    for I in range(0, len(l)):
        if i:
            dom1len = l[i][3] - l[i][2]
            dom2len = l[i-1][3] - l[i-1][2]
            domlen = sorted([dom1len, dom2len])[0]
            overlap = l[i-1][3] - l[i][2]
            if overlap/domlen*100 >= args.overlap:            
                continue
        filtered.append(l[i])

    # Save domains to a file.
    filtered.sort(key=lambda x: x[2])
    for d in filtered:
        oh.write('{0}\t{1}\t{2}\t{3}\t{4}\n'.format(pid, *d))

fh.close()
oh.close()

Run the script from the command line as follows:

python parse_domtblout.py --file yourfile.domtblout --evalue 1e-05 --overlap 40

The tab-delimited output containing filtered domains will be saved in yourtfile.domtblout.out. It will look like this:

Protein_ID<tab>Domain_Name<tab>Domain_ID<tab>Coordinates

I recommend to use E-value threshold of 1e-05, as it is a standard cut-off in the Pfam database.

ADD COMMENT
0
Entering edit mode

Thanks for writing the script Andrzej. It works!! I added the following two lines at the beginning of your code. Hope that's ok.

# Parser for domtblout
# Written By: Andrzej Zielezinski (http://www.staff.amu.edu.pl/~andrzejz/)

From the code it appears that you wrote the script for hmmscan. Can it be adapted for hmmsearch? Secondly why did you use the i-Evalue and not the E-value column? I am comparing the sequence to the domain profiles in Pfam right?

ADD REPLY
0
Entering edit mode

I'm glad I could help. Please feel free to make any modifications in the code. You're right - I wrote this script for hmmscan. But I think it should also work with hmmsearch, since hmmsearch can also write the output in domtblout format. However, if you need any further help, please let me know.

I used i-Evalue column as a threshold because it is used in Pfam web service to filter significant domain hits. I assumed you scanned your sequences against the collection of Pfam domains (Pfam-A and/or Pfam-B). Am I right?

ADD REPLY
1
Entering edit mode
9.7 years ago
venu 7.1k

Assuming you have used --acc flag (if not again do hmmscan with --acc flag), the 7th column is E - value, first remove all the families that are most insignificant ones with a cut off.

$ awk '$7 < cut off value {print $2"\t"$4"\t"$1"\t"..}' yourFile.domtblt

You can include whatever the columns you want in the print section with a tab ("\t").

After that converge the problem even more based on the result

ADD COMMENT
0
Entering edit mode
9.7 years ago
abhijit.synl ▴ 60

I'd like a tab-delimited output, something like

Protein_ID<tab>Domain_Name<tab>Domain_ID<tab>Coordinates (for example 10..35)

Other variations of output are also fine. Because I can write scripts to parse that out.

I'd like to use the best e-val for domain presence/absence decisions and a 40% cut-off for domain conflict resolution.

ADD COMMENT

Login before adding your answer.

Traffic: 1505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6