I finally solved this with the help of the hmmer devs https://github.com/EddyRivasLab/hmmer/issues/302#issuecomment-1539968198
Firstly you have to consider there is not known aminoacidic sequence longer than 100k aminoacids. The program that I mentioned in the post transeq
is not optimum for this case given the fact that it does not search for ORFs, it just translates a DNA sequence without looking for ORFs.
By recommendation of the mentioned dev, I instead used the esl-translate
program implemented in hmmer3. So this is what I did for a metagenomic sequencing reads set for one sample, additionally I added a 4th step to calculate abundance of genes along metagenomic samples:
1) Assembly the reads with MEGAHIT (or with your favorite assembler):
megahit -1 sample1_1.fastq.gz -2 sample1_2.fastq.gz --min-contig-len 300 -o output_dir
2) Retrieve ORFs from the contigs and translate them using esl-translate
. here I specify to use translation table 11 (dedicated for prokaryotes and Archaeal)
edit: as Mensur suggested below, prodigal is the best solution for this and not esl-translate as I specify here
esl-translate -c 11 final.contigs.fa > sample1_orfs.faa
3) Once you have your translated orfs, you can run hmmsearch with your profile file (ending with .hmm) and your translated ORFs:
hmmsearch --tblout sample1_results.txt --cpu `int` mygenes.hmm > sample1.log
4) Once you have your output tables from hmmsearch
you can calculate abundance of those genes by converting the table to a GTF-like table that can be the input file for featureCounts
program. First you can download a python file that I made for that purpose: https://github.com/Valentin-Bio/HMMER2SAF/tree/main
The scripts works in the directory where you have your hmmer output table files (sample1_results.txt) , and it will output a SAF file into the same directory
Once you have your SAF file , you can run featureCounts
to calculate the abundance of your genes of interest.
For this you will need a bam file of your reads aligned against the assembled contigs you made on step 1
featureCounts -p -O -T 32 -F SAF -a sample1_SAF.txt -o sample1_fcounts.txt sample1_sorted.bam
The output table will have information on the last column of the abundance of each gene on your sample, you can iterate these steps over multiple samples and parse those tables to get the abundance information column to create a unique table of samples * genes abundance table.
To get that table I used an own made R script and plotted it in an R stacked barplot.
It is nice that you posted the info to give this question a closure, but this is far from an optimal procedure.
esl-translate
is also not a gene finder - it simply translates sequences in all reading frame. Many of the peptides identified by this program are not really genes, which means that a lot of time is wasted on searching against them.A proper way is to process the (meta)genome with a gene finder. Something like this:
The meaning of all prodigal switches is described here.
yes thought on using prodigal before but it didn't worked on my machine, I will use it instead thanks!