Question

Getting the protein Domain architecture from a domtblout

4

Entering edit mode

9.7 years ago

abhijit.synl ▴ 60

Hello Everyone

I have a million protein sequences that I ran against Pfam and now I have the corresponding domtblout files. I would like to get the summary architectures for each of these proteins. Someone has written a script in Bioperl, but I cannot get it to work. Is there a non-bioperl method of extracting these architectures from domtblout. I can use evalue or coverage as cut-off to resolve the domain overlaps.

Thanks

sequence alignment • 5.0k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by abhijit.synl ▴ 60

0

Entering edit mode

What kind of an output you would like to get from this summary?

ADD REPLY • link 9.7 years ago by Andrzej Zielezinski 11k

Ram · Answer 1 · 2015-03-26

Make sure you have installed Python.

Save the following code in the parse_domtblout.py file.

import argparse
from itertools import groupby
import re

usage = """%(prog)s reads .domtblout file and returns non-overlapped (or
specified percent of overlapping) domains below given e-value.
"""
p = argparse.ArgumentParser(description=usage)
p.add_argument("-f", "--f", "-file", "--file", dest="file",
                  help=".domtblout file")
p.add_argument("-e", "--e", "--evalue", "-evalue", type=float, dest="evalue",
                   help="E-value cut-off", default=1e-05)
p.add_argument("-o", "--o", "--overlap", "-overlap", type=int, dest="overlap",
             help="overlap cut-off (percent)", default=40)

args = p.parse_args()

fh = open(args.file)
oh = open(args.file + '.out', 'w')

grouper = lambda x: re.split('\s+', x)[3] if not x.startswith('#') else x[0]
# Iterate through proteins.
for k, g in groupby(fh, grouper):
    if k.startswith('#'): continue  # Skip header and footer of a domtblout file.
    l = []
    for line in g:
        sl = re.split('\s+', line)
        pid = sl[3]
        dname = sl[0]
        did = sl[1]
        dstart = int(sl[17])
        dend = int(sl[18])
        evalue = float(sl[12])
        if evalue <= args.evalue:
             l.append([dname, did, dstart, dend, evalue])

    # Filter domains by a given E-value and overlapping cut-off.
    filtered = []
    l.sort(key=lambda x: x[2])
    for I in range(0, len(l)):
        if i:
            dom1len = l[i][3] - l[i][2]
            dom2len = l[i-1][3] - l[i-1][2]
            domlen = sorted([dom1len, dom2len])[0]
            overlap = l[i-1][3] - l[i][2]
            if overlap/domlen*100 >= args.overlap:            
                continue
        filtered.append(l[i])

    # Save domains to a file.
    filtered.sort(key=lambda x: x[2])
    for d in filtered:
        oh.write('{0}\t{1}\t{2}\t{3}\t{4}\n'.format(pid, *d))

fh.close()
oh.close()

Run the script from the command line as follows:

python parse_domtblout.py --file yourfile.domtblout --evalue 1e-05 --overlap 40

The tab-delimited output containing filtered domains will be saved in yourtfile.domtblout.out. It will look like this:

Protein_ID<tab>Domain_Name<tab>Domain_ID<tab>Coordinates

I recommend to use E-value threshold of 1e-05, as it is a standard cut-off in the Pfam database.

Ram · Answer 2 · 2015-03-26

Assuming you have used --acc flag (if not again do hmmscan with --acc flag), the 7th column is E - value, first remove all the families that are most insignificant ones with a cut off.

$ awk '$7 < cut off value {print $2"\t"$4"\t"$1"\t"..}' yourFile.domtblt

You can include whatever the columns you want in the print section with a tab ("\t").

After that converge the problem even more based on the result

Ram · Answer 3 · 2015-03-26

0

Entering edit mode

9.7 years ago

abhijit.synl ▴ 60

I'd like a tab-delimited output, something like

Protein_ID<tab>Domain_Name<tab>Domain_ID<tab>Coordinates (for example 10..35)

Other variations of output are also fine. Because I can write scripts to parse that out.

I'd like to use the best e-val for domain presence/absence decisions and a 40% cut-off for domain conflict resolution.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by abhijit.synl ▴ 60