Hii,
I want to use OMA to identify orthologs in multi species using CDS sequences(DNA), Iam just wondering if it will be suitable for that purpose since from the manual they mentioned protein sequences and whole genome as the input data
I have downloaded the standalone version of OMA and try to run it on batch mode ,but Iam getting an error message .
"I found an error in your parameter file. Most probably, you missed a semicolon at the end of a line."
The parameter file seems ok to me.
Type of input sequence data, has to be either 'DNA' or 'AA'
InputDataType := 'DNA';
Output folder
OutputFolder := 'Output';
if you want to recompute everything from scratch everytime the script
is run, set the following parameter to false.
ReuseCachedResults := false;
number of pairwise protein alignments done in one unit. The larger this
number, the longer each unit runs, and the fewer files get produced. This
allows to adjust the frequency of milestone steps (e.g. in case of computer
crash)
AlignBatchSize := 1e6;
alignments which have a score lower than MinScore will not be considered.
The scores are in Gonnet PAM matrices units.
MinScore := 181;
Length tolerance ratio. If the length of the effective alignement is less
than LengthTol*min(length(s1),length(s2)) then the alignment is not
considered.
LengthTol := 0.61;
During the stable pair formation, if a pair has a distance provable higher
than another pair (i.e. StablePairTol standard deviations away) then it is
discarded.
StablePairTol := 1.81;
For the verification of stable pairs, there is also a tolerance parameter
(for details, see Dessimoz et al, Nucl Acids Res 2006)
VerifiedPairTol := 1.53;
Any sequence which is less than MinSeqLen amino acids long in regular
genomes is not considered.
MinSeqLen := 50;
Whether or not OMA should keep only one splicing variant per gene, i.e.
the one with the most homologous matches in all other species.
Annotation of splicing variants needs to be provided in a text file
DB/<genome>.splice
UseOnlyOneSplicingVariant := true;
use experimental code (single processor only) to compute homologous
clusters instead of full All-against-all.
UseExperimentalHomologousClusters := false;
<h6>#</h6>Output parameters
<h6>#</h6>Enables/disables the generation of stable identifiers for OMA groups (and
Hierarchical Groups if their computation enabled). The identifier consists
of a prefix to determine the type of the group ('OMA' or 'HOG'), and a
subsequence of the amino acid sequence uniquely present in this group. The
computation of these ids might require a substantial amount of time. The ids
are stored in the OrthoXML files only.
StableIdsForGroups := false;
Enable/disable guessing of the id types while generating the orthoxml
file. In this context we refer to ID type guessing as the task to
gussing whether an ID should be stored in the geneId, protId or
transcriptId tag. If the flag is set to false, the whole fasta header
is used and stored as is in the protId tag.
GuessIdType := false;
Avoid producing some of the output files. This can reduce computing time
and especially avoids the generation of many files in large analysis. By
default all the output files are generated. Uncomment certain lines to
avoid the production of the corresponding output.
WriteOutput_PairwiseOrthologs := false;
WriteOutput_OrthologousPairs_orthoxml := false; #this file requires lots of time.
WriteOutput_OrthologousGroupsFasta := false;
WriteOutput_HOGFasta := false;
<h6>#</h6>Hierarchical Orthologous Groups
<h6>#</h6>Compute Hierarchical groups?
You can either set it to 'true', which will enable the computation or
disable it by setting it to 'false'. Keep in mind that the current
implementation is not yet parallelized and hence needs quite some time.
DoHierarchicalGroups := true;
Define maximum umount of time (in sec) spent by the program for breaking
every connected component of the orthology graph at its weakest link on a
given taxonomic level. If set to a negative value, no timelimit is enforced.
MaxTimePerLevel := 1200; # 20min
The hierarchical groups need a hierarchy of the involved species in from of
a tree. This tree can either be estimated from the OMA Groups by setting the
SpeciesTree variable to 'estimate', or a (partially resolved) tree can be
given in Newick format. The estimation step needs again additional computing
time.
SpeciesTree := 'estimate';
SpeciesTree := '((mouse,mouse2),human,dog);';
The cutoff of 'average reachability within two steps' defines up to what
point a cluster is split into sub-clusters.
ReachabilityCutoff := 0.65;
<h6>#</h6>ESPRIT -- Detection of split genes
<h6>#</h6>Use Esprit?
You can either set this to 'true', which will enable esprit and shut down the
parts of OMA that are not directly needed for esprit, or set it to 'false' to
make no use of esprit at all.
UseEsprit := false;
NOTE: Genomes in which split genes are to be found should be called
"{unique name}.contig.fa". All other genomes are considered
reference genomes.
ESPRIT PARAMETERS
Confidence level variable for contigs (this is the parameter "tol"
described in the paper)
DistConfLevel := 2;
Min proportion of genomes with which contigs form many:1 BestMatches to
consider that we might be dealing with fragments of the same gene (this is
the parameter "MinRefGenomes" described in the paper, normalized by the
total number of reference genomes)
MinProbContig := 0.4;
Maximum overlap between fragmnents of same gene from different contigs
MaxContigOverlap := 5;
Any sequence which is less than MinSeqLen amino acids long in contigs is not
considered.
MinSeqLenContig := 20;
Minimum score for BestMatch in scaffold recognition
MinBestScore := 250;