Question

How to use HMMBUILD

1

Entering edit mode

10.1 years ago

h.botond ▴ 50

Hello everybody!

I have a little problem with the hmmer.

First of all I downloaded a database which contains multiple protein alignments (eggnog) in a lot of separate multi fasta files. Now I want to use the hmmer's hmmbuild tool to make HMM profile from these alignments and in the end I want to compere my own protein dataset to these profiles with the hmmschen tool.

Now my problem is that I don't know how can I convert all of these fasta files to HMM profiles in on step. I can do separately but because I have a lot of fasta files this approach is impossible.

Have somebody got a good tip?

Thanks for all helps.

hmmer hmmbuild • 15k views

ADD COMMENT • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

1

Entering edit mode

My problem is not the format of the multiple alignment. As I seen the hmmer 3.1b can handle the fasta file format. And this is the only format type what I can chose.

My problem is that I can not figure out how to convert all separate alignments into separate hmm in one step. Can I write an input list and an output list?

Like

hmmbuild hmmfile_list.hmm alignfile_list.fasta

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

0

Entering edit mode

Are your aligned FASTA files one sequence per file? What is the logic based on which the alignment files have been split? We could put them together if we knew how they were split.

EDIT: I checked the eggnog site and downloaded a sample alignment fileset. How they've split the sequences into groups makes no sense to me. Unless we have an MSA with all necessary sequences, creating a profile HMM would not be possible. These look like individual sequences and their homologs and each group is too small to have its own HMM.

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

Because my bad English it is ambiguous what I am writing. Sorry for that.

I don't want to split the alignments. Not at all. I just want to know how can I make hmm from them in one step. I don't want to go through all the fasta files separately with hmmbuild commend. I want to do this in one step if it is possible.

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

0

Entering edit mode

I was asking if you knew what connected each file, as in why they were split by eggnog in the first place. Let's start over.

To build a HMM, you need a MSA where all involved sequences have been aligned. Here, you have a bunch of alignments, where each file holds the alignment of 3-4 sequences. You either look at downloading all the sequences (not the alignments, but the sequences themselves) and aligning them yourself.

In summary, 3 alignments of 3 sequences each is not the same as the alignment of 9 sequences.

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

This is a little more complicated. The eggnog (Non-supervised Orthologous Groups) database is a really huge database. One alignment contains sometime more than 50 proteins. And there are more than 1000 alignments. I want to make a comparative genomic work. I want to sort may 6000 proteins to orthologous groups with the help of the hmm profiles generated from the eggnog database.

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

0

Entering edit mode

So in essence, you want one HMM profile per alignment for your next step, in which you compare your sequences against all profiles and find which one suits it best, correct?

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

Yes, that is right. That is my plan.

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

1

Entering edit mode

UNIX loops, it is! Something like this:

for f in $(ls *.fa)
do
  echo "Processing file: ${f}"
  hmmbuild ${f}_profile.hmm ${f}
done

ADD REPLY • link 3.6 years ago by Ram 44k

1

Entering edit mode

This is great! Thank a lot! It is working!

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

0

Entering edit mode

Glad it worked! I've updated my answer with the code so you can accept it :)

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

If it is not a big problem than I have another question.

Do you know an easy way to sort the proteins according they top HMMer model hit. Keep the best hit release the lower hits.

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

0

Entering edit mode

I haven't used HMMER, but there should ideally be a score or an e-value that you could sort by and pick the top-scoring or lowest e-value match.

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

Yes, you are right. But at this moment I am a programmer analphabet. I am learning the R but I am in the really beginning. Because of that I thought there is an easy way what I should to know.

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

0

Entering edit mode

The manual on HMMER should give you an idea of what output param you can sort by. With the Linux command line, sort+filter should be quite easy!

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

Sorry again! When I tried to run this pipeline with my full dataset an error message popped up "Argument list too long". What can I do in these case?

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

0

Entering edit mode

You have too many fasta files in the directory. You might wanna use patterns that match subsets of your files (in place of *.fa that matches all fasta files). Or, you could create directories and push files into them (1000 files per batch = directories with 1000 files each).

Either way, your target is to ensure that you're processing a manageable number of files.

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

And what kind of pattern could you recomand if I have little more than 160,000 fasta files.

ADD REPLY • link updated 3.6 years ago by Ram 44k • written 10.1 years ago by h.botond ▴ 50

0

Entering edit mode

Here's a shell script to get you going. Not the best, but it works.

List all FASTA files and split into batches:
```
cat *.fa | split -l 1000 - FAFILE
```

Move the contents of each FAFILExyz into a new directory:

for f in $(ls FAFILE*)
>do
>mkdir dir_${f}
>cp `cat ${f} | tr "\n" " "` dir_${f}
>done

Run HMMER once per directory.

Note: I can recommend patterns only if I know how the 160K files are named. That might be a better approach.

ADD REPLY • link 5.1 years ago by Ram 44k

score 1 · Answer 1 · 2014-10-28

I'd suggest this broad approach:

Concatenate FASTA files into a single FASTA file then convert it to GCG MSF (concat might be a bit tricky), or go back to source and download alignment in a supported format - that might be easier.

UNIX for loops might help you deal with multiple files at once.

Updating this answer based on out interaction:

You're looking to create one profile per file. That's a simple matter of looping through the FA files and running the hmmbuild command on each. This should work:

for f in $(ls *.fa)
do
echo "Processing file: ${f}"
hmmbuild ${f}_profile.hmm ${f}
done