Question

HMM gets zero or 1 hits when many more expected

0

Entering edit mode

17 months ago

Gio • 0

Hi all,

My ultimate goal is to understand the phylogeny of a set of restriction-modification enzymes among certain genomes.

For this, I have done the following:

Downloaded all RM genes DNA sequences into psych_rm_genes.fna from REBASE
Cleaned rebase file with sed -E 's/([ACGT]) ([ACGT])/\1\2/g' psych_rm_genes.fna > psych_rm_genes_cleaned.fna then used vim to fix newlines and remove instances of the word fragment.

Optional:

Clustered the DNA sequences with usearch -cluster_fast cleaned_psych_rm_genes.fna -id 0.90 -centroids rm_centroids.fna -uc rm_clusters.uc Note: Use muscle on rm_centroids.
Aligned the DNA sequences with muscle ./muscle5 -super5 rm_genes.fna -output aligned_rm_genes.fasta
Made an HMM using hmmbuild psych_rm.hmm aligned_rm_genes.fasta
Built an anvio HMM directory anvio_psych_hmm/ using psych_rm.hmm
Scanned the contains with the HMM by doing anvi-run-hmms -H anvio_psych_hmm/ -c psych_genomes.db -T 20

NCBI Genome Selection

Ran search on REBASE for all bacteria akin to step 1 above.
Copied GenBank (primary) into Psych_Accesions.txt one per line.
Got genomes from NCBI epost -db nuccore -input Psych_Accessions.txt -format acc | efetch -format fasta > psych_genomes.fasta
Cleaned the genomes using anvi-script-reformat-fasta psych_genomes.fasta -o cleaned_psych_genomes.fasta —simplify-defines —sea-type NT
Made an anvio-db out of the genomes using anvi-gen-contigs-database -f cleaned_psych_genomes.fasta -o psych_genomes.db -T 20 --project-name psych_methylation

My understanding here is that I have downloaded gene sequences that exist in genomes, aligned them, built an HMM, searched genomes from which those genes were taken with that HMM and get 0 hits, or 1 hit if I cluster the gene sequences first. Why is this?

A previous conversation about this exists here.

anvio hmm hmmer phylogeny • 693 views

ADD COMMENT • link 17 months ago by Gio • 0

score 2 · Accepted Answer · 2023-07-16

2

Entering edit mode

17 months ago

Mensur Dlakic ★ 28k

HMMer programs are not primarily meant for DNA sequences. If you translate these RM genes into proteins before alignment, it is likely to work better. Same for the genomes you are searching against: predicting their genes with prodigal will give you a database of proteins that can be interrogated with protein-based HMMs.