Hi all,
My ultimate goal is to understand the phylogeny of a set of restriction-modification enzymes among certain genomes.
For this, I have done the following:
- Downloaded all RM genes DNA sequences into
psych_rm_genes.fna
from REBASE - Cleaned rebase file with
sed -E 's/([ACGT]) ([ACGT])/\1\2/g' psych_rm_genes.fna > psych_rm_genes_cleaned.fna
then usedvim
to fix newlines and remove instances of the word fragment.
Optional:
Clustered the DNA sequences with
usearch -cluster_fast cleaned_psych_rm_genes.fna -id 0.90 -centroids rm_centroids.fna -uc rm_clusters.uc
Note: Usemuscle
onrm_centroids
.Aligned the DNA sequences with
muscle ./muscle5 -super5 rm_genes.fna -output aligned_rm_genes.fasta
- Made an HMM using
hmmbuild psych_rm.hmm aligned_rm_genes.fasta
- Built an anvio HMM directory
anvio_psych_hmm/
usingpsych_rm.hmm
- Scanned the contains with the HMM by doing
anvi-run-hmms -H anvio_psych_hmm/ -c psych_genomes.db -T 20
NCBI Genome Selection
- Ran search on REBASE for all bacteria akin to step 1 above.
- Copied GenBank (primary) into
Psych_Accesions.txt
one per line. - Got genomes from NCBI
epost -db nuccore -input Psych_Accessions.txt -format acc | efetch -format fasta > psych_genomes.fasta
- Cleaned the genomes using
anvi-script-reformat-fasta psych_genomes.fasta -o cleaned_psych_genomes.fasta —simplify-defines —sea-type NT
- Made an anvio-db out of the genomes using
anvi-gen-contigs-database -f cleaned_psych_genomes.fasta -o psych_genomes.db -T 20 --project-name psych_methylation
My understanding here is that I have downloaded gene sequences that exist in genomes, aligned them, built an HMM, searched genomes from which those genes were taken with that HMM and get 0 hits, or 1 hit if I cluster the gene sequences first. Why is this?
A previous conversation about this exists here.
I indeed obtained over 200 hits just by switching over to protein space.