Question

Alignment of protein sequence to Pfam seed alignment

0

Entering edit mode

3.5 years ago

jmungar2 ▴ 10

Hello,

I have to align a series of protein sequences to Pfam seed alignments to subsequently calculate the degree of conservation of certain regions of my protein sequences. I am working with Pfam seed alignments rather than full alignments because I have to do this for over a thousand families and some full alignments are too heavy.

So far I did the following for one of my families:

Download the unaligned seed and the hmm from Pfam
Merge the sequences of my proteins with the sequences of the unaligned seed
Run: hmmalign -o outputfile --trim hmmfile seqfile (the seqfile contains both the the sequences of my proteins and the sequences of the unaligned seed)

The alignment I get is quite good overall but I can see some differences (e.g. some gaps appear, some disappear) when comparing it to the ALIGNED SEED from Pfam,

Is the procedure I'm following correct? I'm new in the field and would very much like to have an expert opìnion on this.

Thank you Juan

pfam alignment • 1.1k views

ADD COMMENT • link updated 2.9 years ago by BlastedBadger ▴ 160 • written 3.5 years ago by jmungar2 ▴ 10

score 1 · Answer 1 · 2021-06-10

1

Entering edit mode

3.5 years ago

Mensur Dlakic ★ 28k

Your procedure is correct.

ADD COMMENT • link 3.5 years ago by Mensur Dlakic ★ 28k

score 0 · Answer 2 · 2022-01-11

Be careful though, because hmmalign does not strictly produce a Multiple Sequences Alignment, because the parts not matching the profile are left unaligned, as stated in the manual:

Important: insertions in a profile HMM are unaligned. Suppose one sequence has an insertion of length 10 and another has an insertion of length 2 in the same place in the profile. The alignment will show ten insert columns, to accomodate the longest insertion. The residues of the shorter insertion are thrown down in an arbitrary order.

The --trim option appears only to cut the unmatched extremities, not the inner insertions ; check in your output files if you see . as gaps or lowercase residues, indicating the insertions.

If your goal is to find conserved regions, this should not be much of a problem, however you cannot for example build a phylogenetic tree from such partial alignment.