HHpred batch submission

Entering edit mode

7.9 years ago

dhalaa1 ▴ 20

Hi, I was wondering if anyone knew of a way to do a batch submission with HHpred? Currently I am able to paste each sequence individually and that can be pretty time consuming. If anyone has any tips or ideas how to submit the sequences as a batch I would appreciate any input. Thank you

Aziza

HHpred batch fasta sequence • 6.0k views

ADD COMMENT • link updated 7.9 years ago by Joe 22k • written 7.9 years ago by dhalaa1 ▴ 20

Entering edit mode

7.9 years ago

Joe 22k

I run hundreds of sequences using HHpred on the commandline. I have a few scripts for tabulating the output data if you're interested.

I don't know if commandline work is an option for you.

Ultimately a local install is always the best option.

EDIT:

To install, follow their instructions here. The tricky bit is making sure you set the environment variables correctly, but they provide all the instructions you should need.

Once you've done that, download the latest databases (note, they're very big and will take up a lot of space and take a long time to download).

wget robots=off -r --no-parent -nH -nd -np -R *.html,*.txt -A .tgz http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/

This will download all of their databases. If you just want, say, PDB, try: wget http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_14Sep16.tgz

Extract the database somewhere memorable

tar xvzf pdb_70_14Sep16.tgz

You're then ready to actually start searching, so you can do something like what's below. These options are specific to how I use the software though, so read their manual and determine what parameters for output you want.

for file in ./*.faa ; do
        hhsearch -dbstrlen 50 -B 1 -b 1 -p 60 -Z 1 -E 1E-03 -nocons -nopred -nodssp -cpu 10 -i $file -d /path/to/database/pdb70_hhm.ffdata
done

You can replace this with a parallelisation call if you like using something like GNU parallels, but just be aware that you can run multiple sequences, but you can also specify that each sequence be searched against other HMMs with multiple cores, so don't over do it.

This will generate you a result file for each sequence that will look something like:

Query         PAK_01787 PAK_01787 T4-like virus tail tube protein gp19 1997281:1997730 forward MW:16801
Match_columns 149
No_of_seqs    1 out of 1
Neff          1.0
Searched_HMMs 25094
Date          Thu Dec  3 14:08:12 2015
Command       hhsearch -cpu 10 -i /home/wms_joe/PVCs/PVC_operons/all/PAK_01787.fsa -d /home/wms_joe/Applications/HHSuite/databases/pdb70/pdb70_hhm.ffdata -B 10 -Z 10 -E 1E-03

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 1tvs_A Transactivator protein;  33.0      11 0.00046   25.9   0.0   14   27-42     38-51  (75)
  2 3jqo_B TRAO protein; helical o  31.3      13 0.00051   25.6   0.0   32   44-77     41-72  (135)
  3 2n01_B VIRB9 protein; T4SS, li  30.4      14 0.00054   24.1   0.0   25   42-66     32-60  (106)
  4 1p65_A Nucleocapsid protein; v  20.6      27  0.0011   23.9   0.0   18   17-34     13-30  (73)
  5 2wj5_A Heat shock protein beta  20.4      28  0.0011   20.3   0.0   31   80-113    13-43  (101)
  6 2ltk_A Mono-cysteine glutaredo  18.4      32  0.0013   19.4   0.0   56   35-90     42-100 (110)
  7 1lrw_B Methanol dehydrogenase   16.1      40  0.0016   23.5   0.0   33  109-141    12-46  (83)
  8 1vyb_A ORF2 contains A reverse  15.6      42  0.0017   20.0   0.0   12  106-117     9-20  (238)
  9 1kaf_A Transcription regulator  14.0      49   0.002   23.3   0.0   27   85-111    14-40  (108)
 10 3but_A Uncharacterized protein  13.0      55  0.0022   21.4   0.0   39   88-126     2-40  (136)

No 1
>1tvs_A Transactivator protein; transcription regulation; NMR {Equine infectious anemia virus} SCOP: j.40.1.1 PDB: 1tvt_A
Probab=32.99  E-value=11  Score=25.86  Aligned_cols=14  Identities=43%  Similarity=0.833  Sum_probs=12.3

Q PAK_01787        27 QMCFQSVSGLDISYDT   42 (149)
Q Consensus        27 qmcfqsvsgldisydt   42 (149)
                      |.||..  ||-|||..
T Consensus        38 qlCFlk--GLGIsYg~   51 (75)
T 1tvs_A           38 QLCFLR--SLGIDYLD   51 (75)
T ss_pred             HHHhcc--CCcccccC
Confidence            789999  99999973
...

At which point you can use/modify my script below, which will turn the output file in to a tab delimited file (I then concatenate all my results together so they can be viewed in spreadsheets etc. The script as it is will only return the best hit (No 1) so if you want more than that you'll have to get creative!

	# -- coding: utf-8 --

	"""
	This script takes the .hhr files output by HHSuite and
	turns the quite verbose file in to a fully tabulated
	version with all the fields separated one, one line per
	file. Thus, the file can be viewed simply in Excel etc.
	It requires the non-standard pandas module.
	"""

	# This program is free software: you can redistribute it and/or
	# modify it under the terms of the GNU General Public License as
	# published by the Free Software Foundation, either version 3 of
	# the License, or (at your option) any later version.

	# This program is distributed in the hope that it will be useful,
	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	# GNU General Public License for more details.

	# The license for HHSuite remains with the authors. Please consult
	# their licensce agreements before using this software, and ensure
	# they are cited for any use. Run the script with -b to print rel-
	# evant references.


	import os
	import sys
	import traceback
	import warnings

	__author__ = "Joe R. J. Healey"
	__version__ = "1.0.0"
	__title__ = "tabulateHHpred"
	__license__ = "GPLv3"
	__author_email__ = "J.R.J.Healey@warwick.ac.uk"

	template = \
	u"""
	---\|------\|------------------------\|----\|-------\|-------\|------\|-----\|----\|---------\|--------------\|
	No Hit Short Desc Prob E-value P-value Score SS Cols Query HMM Template HMM
	"""


	def hhparse(hhresult_file, verbose):
	'''Convert HHpred's text-based output table in to a pandas dataframe'''
	from io import StringIO
	import pandas as pd

	pattern = StringIO(template).readlines()[1]
	colBreaks = [i for i, ch in enumerate(pattern) if ch == '\|']
	widths = [j-i for i, j in zip( ([0]+colBreaks)[:-1], colBreaks ) ]

	hhtable = pd.read_fwf(hhresult_file, skiprows=8, nrows=10, header=0, widths = widths)
	if verbose is True:
	print(hhtable)

	top_hit = str(hhtable.loc[0,'Hit'])[0:4]
	top_hit_full = hhtable.loc[0,'Hit']
	top_prob = hhtable.loc[0,'Prob']
	top_eval = hhtable.loc[0,'E-value']
	top_pval = hhtable.loc[0,'P-value']
	top_score = hhtable.loc[0,'Score']

	if verbose is True:
	print("Your best hit: (PDB ID \| Probability \| E-Value \| P-Value \| Score)")
	print("\t" + str(top_hit) + "\t" + str(top_prob) + "\t" + str(top_eval) + "\t" + str(top_pval) + "\t" + str(top_score) )

	return top_hit, top_hit_full, top_prob, top_eval, top_pval, top_score

	def getFullDesc(hhresult_file,top_hit_full, verbose):
	with open(hhresult_file, 'r') as hrh:
	for line in hrh:
	if line.startswith('>' + top_hit_full):
	full_desc = line.rstrip('\n')

	if verbose is True:
	print full_desc

	return full_desc

	def getDOI(top_hit):
	"""Query the PDB REST API to get an associated DOI/Publication"""
	import requests
	try:
	query = requests.get("https://www.ebi.ac.uk/pdbe/api/pdb/entry/publications/" + str(top_hit))
	qjson = query.json()
	doi = qjson[top_hit][0]['doi']

	if not doi:
	doi = "No DOI found."

	except KeyError:
	doi = "Key error. ID likely deprecated."

	return doi

	def displayRefs():
	"""Display relevant references"""
	print('''
	- The following publications pertain to HHSuite:
	- Alva V., Nam SZ., Söding J., Lupas AN. (2016)
	The MPI bioinformatics Toolkit as an integrative platform for advanced protein sequence and structure analysis.
	Nucleic Acids Res. pii: gkw348. PMID: 27131380

	- Remmert M., Biegert A., Hauser A., Söding J. (2011) HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment.
	Nat Methods. 9(2):173-5. doi: 10.1038/nmeth.1818. PMID: 22198341

	- Söding J., Biegert A., Lupas AN. (2005) The HHpred interactive server for protein homology detection and structure prediction.
	Nucleic Acids Res 33 (Web Server issue), W244-W248. PMID: 15980461

	- Söding J. (2005) Protein homology detection by HMM-HMM comparison.
	Bioinformatics 21: 951-960. PMID: 15531603

	- Hildebrand A., Remmert A., Biegert A., Söding J. (2009) Fast and accurate automatic structure prediction with HHpred.
	Proteins 77(Suppl 9):128-32. doi: 10.1002/prot.22499. PMID: 19626712

	- Meier A., Söding J. (2015) Automatic Prediction of Protein 3D Structures by Probabilistic Multi-template Homology Modeling.
	PLoS Comput Biol. 11(10):e1004343. doi: 10.1371/journal.pcbi.1004343. PMID: 26496371
	''')
	sys.exit(1)

	def parseArgs():
	"""Parse commandline arguments"""

	import argparse
	try:
	parser = argparse.ArgumentParser(description='This script converts HHpreds verbose output in to a full table for subsequent analysis.')
	parser.add_argument(
	'-i',
	'--infile',
	action='store',
	help='The HHpred output file to parse.')
	parser.add_argument(
	'-v',
	'--verbose',
	action='store_true',
	help='Print additional messages to screen. Default behaviour is false, only the result would be printed to screen for piping etc.')
	parser.add_argument(
	'-o',
	'--outfile',
	action='store',
	default='None',
	help='Output file name to store results in. If none provided, the default will be infile.tsv.')
	parser.add_argument(
	'-b',
	'--bibliography',
	action='store_true',
	help='Display references.')
	parser.add_argument(
	'-d',
	'--doi',
	action='store_true',
	help='Try to retrieve relevant publications from PDB RESTful API. Warning, this will significantly slow down the processing of results.')
	args = parser.parse_args()

	except:
	print "An exception occured with argument parsing. Check your provided options."
	traceback.print_exc()

	return parser.parse_args()

	def main():
	"""Make function calls and handle arguments for tabulateHHpred.py"""
	args = parseArgs()

	if args.bibliography is True:
	displayRefs()

	verbose = args.verbose
	hhresult_file = args.infile

	indir = os.path.dirname(os.path.abspath(hhresult_file))

	split = os.path.splitext(args.infile)
	basename = os.path.basename(split[0])

	if args.outfile is 'None':
	outfile = indir + '/' + basename + '.tsv'
	else:
	outfile = args.outfile

	# Main code begins:

	top_hit, top_hit_full, top_prob, top_eval, top_pval, top_score = hhparse(hhresult_file, verbose)
	full_desc = getFullDesc(hhresult_file, top_hit_full, verbose)

	row = (basename + "\t" + str(top_hit) + "\t" + str(top_hit_full) + "\t" + str(top_prob) + "\t" + str(top_eval) + "\t" + str(top_pval) + "\t" + str(top_score) + "\t" + full_desc + "\n")

	if args.doi is True:
	doi = getDOI(top_hit)
	row = (row.rstrip("\n") + "\t" + doi + "\n")

	with open(outfile, 'w') as ofh:
	ofh.write(row)
	print(row)

	elif args.doi is False:
	with open(outfile, 'w') as ofh:
	ofh.write(row)
	print(row)


	if __name__ == '__main__':
	main()

view raw tabulateHHpred.py hosted with ❤ by GitHub

ADD COMMENT • link 7.9 years ago by Joe 22k

Entering edit mode

Yes jrj.healey please! I'm am very knew to this but I knew there had to be someway I just could not figure out how. What is commandline and how do I use it? Thank you

ADD REPLY • link 7.9 years ago by dhalaa1 ▴ 20

Entering edit mode

I've edited my answer with as much information as I can remember (it's been a while since I installed it!)

ADD REPLY • link 7.9 years ago by Joe 22k