I'm trying to find/use a popular gene finding tool called genBlastG 1.39 (She et al. 2011), but the source website seems to be deprecated, giving me 404 errors. It appears to have not been updated for 7 years. I have tried downloading the source code on linux using:
but still get the 404 error. But it has been cited by recent genomics papers (e.g. Navarro-Escalante et al. 2021, Tan et al. 2021), so clearly someone is getting it to work. Does anyone have any ideas on how to use this program?
Or if anyone has newer alternatives, I'm open to that. I see Keilwagen et al. 2018 have created GeMoMa, so I may look into that. The newest OrthoFinder (Emms & Kelly 2019) seems to have some of the same function in order to categorize genes into orthogroups, but I would like to compare the OrthoFinder results with genBlastG, since some papers use genBlastG instead or in addition to OrthoFinder.
Thank you both! I'm trying it now. It seems the default installation with bioconda is version 1.38, whereas 1.39 gives me an incompatibility "UnsatisfiableError" with libstdcxx-ng. But perhaps I can work around it...
As a follow-up, GenoMax and Mensur Dlakic, I was able to install both genblast packages with bioconda (thank you!) but I can't figure out how to execute them. I keep getting the dreaded "command not found" error for either one, despite ensuring to add the miniconda3 path and editing .bashrc, etc. This seems to be the problem someone experienced on this thread. Someone suggested activating the conda environment, but genblasta or genblastg aren't environments. (I tried it anyway, of course.) Any ideas on why this many be happening? I keep thinking I've set up conda wrong, but I was able to install and use GeMoMa via bioconda. Thanks again for any advice!
Thanks for the reply! I did activate my conda environment and then installed genblastg. I called the environment something else, but I don't see why that should matter. I'll see if this works...
Unless the named environment (where you installed the program) is active you will not be able to run the installed program. Simply conda activate only activates the base environment.
Thanks! Yes, I had created a conda environment called opencv, activated opencv, and installed the genblast programs therein, but they would not run. I also tried another environment. It did not work until I used the genblast environment name that you and Mensur Dlakic recommended. I don't understand it, but it only worked for that environment.
genBlastA release v1.0.1
SYNOPSIS:
Given a list of query protein or DNA sequences and a target database that
consists of DNA sequences, this program runs wu-blast tblastn on the list
of sequences provided, then for each query, it groups the resulted HSPs
into sensible groups so that each group of HSPs corresponds to a potential
target gene that is homologous to the query. The output is ranked according
to their homology to the query.
Command line options:
-P Search program used to produce blast-format sequence alignments,
can be either "blast" or "wublast", default is "blast",
optional
-q List of query sequences to blast, must be in fasta format,
required
-t The target database of genomic sequences in fasta format,
required
-p Whether query sequences are protein sequences (T/F)
[default: T], optional
-pg Specify which blast/wublast program to run. If not specified,
the default behaviour is to run tblastn (for blast/wublast protein
sequence) / blastn (for blast nucleotide sequence) or tblastx
(for wublast nucleotide sequence).
-e parameter for blast: The e-value, [default: 1e-2],
optional
-g parameter for blast: Perform gapped alignment (T/F)
[default: T], optional
-f parameter for blast: Perform filtering (T/F) [default: F],
optional
-a parameter for genBlast: weight of penalty for skipping HSPs,
between 0 and 1 [default: 0.5], optional
-d parameter for genBlast: maximum allowed distance between HSPs
within the same gene, a non-negative integer [default: 100000],
optional
-r parameter for genBlast: number of ranks in the output,
a positive integer, optional
-c parameter for genBlast: minimum percentage of query gene
coverage in the output, between 0 and 1 (e.g. for 50%
gene coverage, use "0.5"), optional
-s parameter for genBlast: minimum score of the HSP group in
the output, a real number, optional
-o output filename, optional. If not specified, the output
will be the same as the query filename with ".gblast"
extension.
Example:
genblasta -P blast -pg tblastn -q myquery -t mytarget -p T -e 1e-2 -g T -f F -a 0.5 -d 100000 -r 10 -c 0.5 -s 0 -o myoutput
(Rong She (rshe@cs.sfu.ca) May 2010)
I'm very grateful to both you and GenoMax for your helpful advice! The programs appear to be functional now. The only issue I now have is that I get an error that says
sh: 1: ./formatdb: not found
XDF file error
So it seems that it requires a blast database, even though the documentation doesn't mention this input. If I specify wublast, the error says ./xdformat: not found instead. I wonder if I could use a .blastxml search database -- presumably for the target species not the query sequences from the reference species.
In case anyone ends up needing genBlastG, I found a solution. Seven years ago, Michael Paulini posted the genblastg_patch for WormBase on github. Whenever I tried to use this version or the versions available on conda, I got an error saying it could not find the blastall or formatdb files, or that genblastG was not a viable command. Last year, Guisen Chen posted a python version called genblastG_extension on github. However, it also failed because it was missing the alignscore.txt file, but it does have those other missing files. So I copied the files contained in genblastg_patch (which included alignscore.txt) into the genblastG_extension directory, and executed it without python, using the same syntax shown by Mensur Dlakic above. And it works!
Obviously this wouldn't be necessary if the original package had been maintained, likely a consequence of transient workers like grad students or postdocs creating something and never managing it thereafter. This may be the case with a new package called TGFAM-Finder, which is also meant to be a homology-based gene finder, because every time I tried to install it, I got an error saying "resource temporarily unavailable." It would be ideal to use GeMoMa for homology-based gene finding, but there too I got errors saying "there are gene annotations on chromosomes/contigs with missing reference sequences ..." and "Did not finish as intended." But GeMoMa is mainly developed for whole genome annotation anyway, with no tutorials on gene family analysis. The growing need to compare gene families among already-annotated genomes and newly annotated ones will probably lead to new packages in the coming years so that hacking a deprecated one won't be necessary.
Glad to let you all know, there is a new tool provided by Heng Li: miniprot, you will like it.
By the way, is there any one thinking about to revise genblastA, to let it take in output from blast, instead of running it within. Apparently, we can run blast in parallel by ourself, which will be much much faster!
Thank you both! I'm trying it now. It seems the default installation with bioconda is version 1.38, whereas 1.39 gives me an incompatibility "UnsatisfiableError" with libstdcxx-ng. But perhaps I can work around it...
As a follow-up, GenoMax and Mensur Dlakic, I was able to install both genblast packages with bioconda (thank you!) but I can't figure out how to execute them. I keep getting the dreaded "command not found" error for either one, despite ensuring to add the miniconda3 path and editing .bashrc, etc. This seems to be the problem someone experienced on this thread. Someone suggested activating the conda environment, but genblasta or genblastg aren't environments. (I tried it anyway, of course.) Any ideas on why this many be happening? I keep thinking I've set up conda wrong, but I was able to install and use GeMoMa via bioconda. Thanks again for any advice!
If you had simply done
then you need to
conda activate
(i.e. activate the base environment)At this point you should be able to find the executable.
Ideally you should have created a new environment
Thanks for the reply! I did activate my conda environment and then installed genblastg. I called the environment something else, but I don't see why that should matter. I'll see if this works...
Unless the named environment (where you installed the program) is
active
you will not be able to run the installed program. Simplyconda activate
only activates the base environment.Thanks! Yes, I had created a conda environment called opencv, activated opencv, and installed the genblast programs therein, but they would not run. I also tried another environment. It did not work until I used the genblast environment name that you and Mensur Dlakic recommended. I don't understand it, but it only worked for that environment.
To create an environment:
Activate:
Then type
genblastA
orgenblastG
, as needed:I'm very grateful to both you and GenoMax for your helpful advice! The programs appear to be functional now. The only issue I now have is that I get an error that says
So it seems that it requires a
blast
database, even though the documentation doesn't mention this input. If I specifywublast
, the error says./xdformat: not found
instead. I wonder if I could use a.blastxml
search database -- presumably for the target species not the query sequences from the reference species.