Our lab sequenced 7 genomes from the same organism, the divergence time between these genomes is less than 0.5 Mya.
~ 42000 ortholog gene families were constructed from these genomes,
to find positive selected genes,
I use codeml M2a vs M1a to test (df=2) for positive selection,
using a FDR cutoff of 0.01, I got ~16000 (38%) genes that are under positive selction,
does codeml suitable for our dataset (genes come from different strains of the same species)?
the control file i specified is:
seqfile = input.seq
treefile = input.tree
outfile = mlc
noisy = 3
verbose = 1
runmode = 0
seqtype = 1
CodonFreq = 2
clock = 0
model = 1 2
NSsites = 2
icode = 0
fix_omega = 0
omega = .9
fix_kappa = 0
kappa = .3
cleandata = 1``
How did you obtain individual CDS from 14000 orthologs after assembling the genomes? I'm trying to do something similar using codeml, but I only know how to obtain single genes at a time. Recently I obtained genome data so I would like to obtain all CDS and apply codeml to all of them - would you mind sharing a bit of how you process such a vast number? I use bwa mem for my assembly.