HMM gets zero or 1 hits when many more expected
1
0
Entering edit mode
17 months ago
Gio • 0

Hi all,

My ultimate goal is to understand the phylogeny of a set of restriction-modification enzymes among certain genomes.

For this, I have done the following:

  1. Downloaded all RM genes DNA sequences into psych_rm_genes.fna from REBASE
  2. Cleaned rebase file with sed -E 's/([ACGT]) ([ACGT])/\1\2/g' psych_rm_genes.fna > psych_rm_genes_cleaned.fna then used vim to fix newlines and remove instances of the word fragment.

Optional:

  1. Clustered the DNA sequences with usearch -cluster_fast cleaned_psych_rm_genes.fna -id 0.90 -centroids rm_centroids.fna -uc rm_clusters.uc Note: Use muscle on rm_centroids.

  2. Aligned the DNA sequences with muscle ./muscle5 -super5 rm_genes.fna -output aligned_rm_genes.fasta

  3. Made an HMM using hmmbuild psych_rm.hmm aligned_rm_genes.fasta
  4. Built an anvio HMM directory anvio_psych_hmm/ using psych_rm.hmm
  5. Scanned the contains with the HMM by doing anvi-run-hmms -H anvio_psych_hmm/ -c psych_genomes.db -T 20

NCBI Genome Selection

  1. Ran search on REBASE for all bacteria akin to step 1 above.
  2. Copied GenBank (primary) into Psych_Accesions.txt one per line.
  3. Got genomes from NCBI epost -db nuccore -input Psych_Accessions.txt -format acc | efetch -format fasta > psych_genomes.fasta
  4. Cleaned the genomes using anvi-script-reformat-fasta psych_genomes.fasta -o cleaned_psych_genomes.fasta —simplify-defines —sea-type NT
  5. Made an anvio-db out of the genomes using anvi-gen-contigs-database -f cleaned_psych_genomes.fasta -o psych_genomes.db -T 20 --project-name psych_methylation

My understanding here is that I have downloaded gene sequences that exist in genomes, aligned them, built an HMM, searched genomes from which those genes were taken with that HMM and get 0 hits, or 1 hit if I cluster the gene sequences first. Why is this?

A previous conversation about this exists here.

anvio hmm hmmer phylogeny • 694 views
ADD COMMENT
2
Entering edit mode
17 months ago
Mensur Dlakic ★ 28k

HMMer programs are not primarily meant for DNA sequences. If you translate these RM genes into proteins before alignment, it is likely to work better. Same for the genomes you are searching against: predicting their genes with prodigal will give you a database of proteins that can be interrogated with protein-based HMMs.

ADD COMMENT
0
Entering edit mode

I indeed obtained over 200 hits just by switching over to protein space.

ADD REPLY

Login before adding your answer.

Traffic: 2138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6