Hello everyone,
I am trying to determine the abundance of a gene in a set of metagenomic short reads using the hmm of this gene, then use this method to compare the abundance of the gene across several samples after normalizing for data size. What I would do is to first translate the genes into ORFs, then run hmm against the reads. Since the amino acid sequence of this HMM is longer (~220 amino acids) than the read length (150 bases, which translates to ~50 amino acids), I am wondering if hmm can accurately detect the partial sequences and still report hits?
Thank you!
Rui
OP, take a look at
PLASS
. This is one of the applications it was designed for.Thanks, I will surely take a look into it!
Thanks for the comments! I was just curious about how would the results differ between reads and assembly based analysis. I also assembled the reads with MEGAHIT, after retaining the contigs longer than 500 bases, I remapped the reads back onto the contigs with bowtie2, but the alignment rate is only about 30% (probably because my samples are derived from soil). So I just thought only using the contigs would result in a large chunk of the data being wasted. However, the problem is even the reads cannot be guaranteed to capture the whole picture of the functional diversities in the environment. I think using assemblies would be a better idea in my case.
If you still care about the reads not included in the final assembly you might want have a look at the raw reads analysis pipeline v5.0 from MGnify. It does not involve any assembly step and it is specifically designed for the functional annotation of raw sequencing data
Thank you for the suggestion, I will take a look into it! I've tried similar tools before such as HUMANN3 that is based on reads data for functional annotation, but the results were not satisfactory, ~90% of the reads cannot be classified. While looking for a solution, I found this paper interesting as it benchmarked various analysis approaches. It seems assemblies are preferred over reads when the alignment rates are high.