Question

Can HMMs detect partial proteins sequences?

0

Entering edit mode

3.2 years ago

Rui ▴ 50

Hello everyone,

I am trying to determine the abundance of a gene in a set of metagenomic short reads using the hmm of this gene, then use this method to compare the abundance of the gene across several samples after normalizing for data size. What I would do is to first translate the genes into ORFs, then run hmm against the reads. Since the amino acid sequence of this HMM is longer (~220 amino acids) than the read length (150 bases, which translates to ~50 amino acids), I am wondering if hmm can accurately detect the partial sequences and still report hits?

Thank you!

Rui

metagenomics HMM • 1.7k views

ADD COMMENT • link 3.2 years ago by Rui ▴ 50

score 3 · Answer 1 · 2021-09-05

3

Entering edit mode

3.2 years ago

Mensur Dlakic ★ 28k

If you use HMMer, the answer is yes - unless you change the default glocal scoring. However, there is going to be less signal in 50 residues compared to the whole protein, so you may miss some divergent sequences. Also, some reads will have less than 49-50 residues if they are at either end of the protein, which may also leave them out.

Why not assemble instead and determine the abundance properly?

ADD COMMENT • link 3.2 years ago by Mensur Dlakic ★ 28k

2

Entering edit mode

Why not assemble instead <snip>

OP, take a look at PLASS. This is one of the applications it was designed for.

ADD REPLY • link 3.2 years ago by Dunois ★ 2.8k

0

Entering edit mode

Thanks, I will surely take a look into it!

ADD REPLY • link 3.2 years ago by Rui ▴ 50

0

Entering edit mode

Thanks for the comments! I was just curious about how would the results differ between reads and assembly based analysis. I also assembled the reads with MEGAHIT, after retaining the contigs longer than 500 bases, I remapped the reads back onto the contigs with bowtie2, but the alignment rate is only about 30% (probably because my samples are derived from soil). So I just thought only using the contigs would result in a large chunk of the data being wasted. However, the problem is even the reads cannot be guaranteed to capture the whole picture of the functional diversities in the environment. I think using assemblies would be a better idea in my case.

ADD REPLY • link 3.2 years ago by Rui ▴ 50

0

Entering edit mode

If you still care about the reads not included in the final assembly you might want have a look at the raw reads analysis pipeline v5.0 from MGnify. It does not involve any assembly step and it is specifically designed for the functional annotation of raw sequencing data

ADD REPLY • link 3.2 years ago by andres.firrincieli 3.8k

0

Entering edit mode

Thank you for the suggestion, I will take a look into it! I've tried similar tools before such as HUMANN3 that is based on reads data for functional annotation, but the results were not satisfactory, ~90% of the reads cannot be classified. While looking for a solution, I found this paper interesting as it benchmarked various analysis approaches. It seems assemblies are preferred over reads when the alignment rates are high.

ADD REPLY • link 3.2 years ago by Rui ▴ 50