Question

E-value results from hmmsearch are not accurate. [HMMER]

0

Entering edit mode

7.1 years ago

danzanzu • 0

Good Afternoon,

I am utilizing hmmer tools to analyze and better understand a DNA Sequence dataset that I have obtained referred to as the H3 Dataset containing dna sequences both being from class 0 and class 1.

The following is what I did:

Acquired 7000 DNA Sequences of the H3 Dataset which are Class 0, and built an hmm profile for 80% of it, resulting in 5600 sequences.
free photo hosting for ebay
Next, is that I took 20% of the dna sequences from both the Class 0 and Class 1.
top free photo hosting sites
Now I performed hmmsearch command using both the 20% dna sequences as search criteria on the previously formed HMM Profile. What I expect is that the hmmsearch performed on the Class 0 Sequences is to have a lot of DNA sequences above the inclusion threshold, and also have lots of e-values which are near the 0 value.

Resulting in the below output file.

How come no targets have been detected? I though I did something wrong until this point, so I performed one last experiment.

I performed an hmmsearch on the hmm profile, having the search criteria of the same sequences which formed the profile, which when thinking about it, there must be matches since they are the same sequences in the exact form, and the result out file is the below:

Once again, no hits were detected.

So my final question is: Am I using the hmmbuild and hmmsearch in the correct way and how can I improve the results in any form? It is extremely strange that I am comparing the exact same sequences and getting no hits. Any help would be appreciated

Thanks.

sequencing sequence • 4.6k views

ADD COMMENT • link updated 7.1 years ago by Shyam ▴ 150 • written 7.1 years ago by danzanzu • 0

2

Entering edit mode

In the output of hmmbuild, "eff_nseq" is too high and "re/pos" is too low. I think that is because your input multi-FASTA file is not aligned and resulting hmm is nonsense. Thus does not hit against any sequence.

ADD REPLY • link 7.1 years ago by fishgolden ▴ 520

1

Entering edit mode

I agree with that. Looks like the input is a random alignment. The title of the question is misleading.

ADD REPLY • link 7.1 years ago by cryptogenomicon ▴ 160

score 3 · Answer 1 · 2017-10-19

3

Entering edit mode

7.1 years ago

Shyam ▴ 150

For building a hmm from sequences you need to make a multiple sequence alignment first and use that alignment in fasta format as input for hmmbuild. If didnt align the sequences there is no way the resulting profile can predict any homology when you search back the input sequences. Hope this answers your question.

ADD COMMENT • link 7.1 years ago by Shyam ▴ 150

0

Entering edit mode

I thought about what you said. So what I did was to download to MUSCLE multiple sequence alignment tool, to align the sequences so that the output could be fed to the hmmbuild and form a meaningful hmm profile.

This is the aligned output from the MUSCLE program.
free html images

Now I fed this aligned output produced to the hmmer software to build the profile in the following manner:

However after aligning the sequences still the NSEQ is too high and the re/pos is too small, resulting in no hits.

ADD REPLY • link 7.1 years ago by danzanzu • 0

1

Entering edit mode

Are you sure that those nucleotide entries in your fasta are from the same family, the same super family or the same fold that you want to make profile? I googled the names of the entries and found that they have various descriptions which I thought they must belong to different families (SUL1 and SUL2 might belong to the same). Making profiles using entries which do not have evolutionary relationships are also nonsense (in most of the time).

Or the sequences do have evolutionary relationships, but if they are too diverged, hmm construction will fail, too. But in that case, eff_nseq will be lower.

(& when you use DNA, be careful for directions of the strands.)

ADD REPLY • link 7.1 years ago by fishgolden ▴ 520

0

Entering edit mode

The Nucleotide entries where taken from a research paper and can be found here, so that you can have a better understanding of the data: http://www.jaist.ac.jp/~tran/nucleosome/members.htm

Now in the above example I took a random 80% of the H3 negative class dna sequences, and build a multiple sequence alignment file using Muscle, and the building of the hmm profile is above.

I am still stuck with trying to formulate a good hmmbuild profile, since I am a bit of a beginner.

Any suggestions where to continue?

ADD REPLY • link 7.1 years ago by danzanzu • 0

1

Entering edit mode

You are using "negative" dataset? Does it mean your hypothesis is that there are histone avoiding motifs in the dataset and you were going to model Histone avoiding motif with HMM? It is very interesting.

I'm not a nucleotide person and have not used HMMER or MUSCLE so much, following comments are based on my insufficient knowledge, but very general, I think.

(Please correct me if I am wrong, somebody)

Problems:

The dataset you are using contains sequences (entries) which do not have evolutionary relationships (Because it is result of chip-chip).
Chip-chip data contains not only histone binding motifs but also unrelated regions around them.

As I mentioned previous comment, HMM made from such unrelated sequences is nonsense. But I want to add exception, though sequences do not have evolutionary relationships, sometimes you can build HMM when they have some strong pattern like signal peptide or transmembrane regions. However, I don't think Histone binding or avoiding motifs have such a strong pattern.

HMMER and MUSCLE, sequence searcher or aligner, are designed to find and align evolutionary related sequences or regions.

Such evolutionary related regions are independent (I think) from Histone avoiding motif. Therefore, Histone avoiding motif will be corrupted in the resulting alignment. (& if you successfully made HMM of Histone avoiding motif, the motif is widely distributed in the genome. The e-value might become very high. But I'm not sure. A mere conjecture)

But the idea is interesting.

Normal HMM build pipeline was failed... The author of the dataset you referred is using k-gram+SVM?... then if we can construct k-gram HMM??? or cluster related sequences and separately make HMM ... Hmmmmmm..... I think it requires further investigation of published papers (may be someone has tried) and very deep discussions.

I think you should discuss with your supervisor or someone who has much knowledge about this field.

ADD REPLY • link 7.1 years ago by fishgolden ▴ 520