Question

How do I get all sequences for one specific domain from Hmmer vs Pfam output that contains the results of many domains ?

0

Entering edit mode

4.8 years ago

emilyc ▴ 30

Hello,

Is there a simple way of extracting all sequences for a specific domains from hmmer vs Pfam results ?

I have used esl-reformt to output the fasta containing the sequence for each domain found in my input, however it doesn't specify the domain of each sequence. Many domains were found in the original multi fasta file, but I just want the sequences corresponding to one to be extracted.

Thanks in advance for any advice,

Emily

domain hmmer pfam • 3.0k views

ADD COMMENT • link updated 4.8 years ago by Mensur Dlakic ★ 29k • written 4.8 years ago by emilyc ▴ 30

0

Entering edit mode

not sure what or how you did it but if you run your proteins through InterPro you will get per input protein a list of all it matching domains.

From the output file you can then either grep on the a protein ID to get all its domains or grep a domainID to get all proteins that have that specific domain

ADD REPLY • link 4.8 years ago by lieven.sterck 15k

score 1 · Answer 1 · 2020-06-30

There are at least two ways of doing this. The easier one is to go to Pfam directly and download aligned hits for the domain of interest. Let's say that you want hits for 7tm_1 or PF00001:

https://pfam.xfam.org/family/PF00001

Click on Alignments on the left-hand side, and select the database in the Format an alignment section. Let's say you want all UniProt members (~165K sequences) so click on a button that corresponds to it, select the alignment type and click on Generate to download. This works if you are OK with the database choices they have.

If you want to search a custom database, click on Curation & model and download a raw HMM for your domain - in this case PF00001.hmm. Now, do the hmmsearch (local HMMer installation required):

hmmsearch -E 0.001 --cpu 4 -A out_alignment.stk PF00001.hmm /path/to/your_database.fasta

In this case out_alignment.stk will contain a Stockholm alignment of all the hits, which esl-reformat can convert to other formats as needed. You may want to adjust the E-value threshold and the number of CPUs.